# Logistic regression¶

```
import numpy as np
import pandas as pd
# Safe settings for Pandas.
pd.set_option('mode.chained_assignment', 'raise')
%matplotlib inline
import matplotlib.pyplot as plt
# Make the plots look more fancy.
plt.style.use('fivethirtyeight')
# Optimization function
from scipy.optimize import minimize
```

By Peter Rush and Matthew Brett, with considerable inspiration from the logistic regression section of Allen Downey’s book Think Stats, second edition.

In this section we will look at another regression technique: logistic regression.

We use logistic regression when we want to predict a *binary categorical*
outcome variable (or column) from one or more predicting variables (or
columns).

A binary categorical variable is one where an observation can fall into one of only two categories. We give each observation a label corresponding to their category. Some examples are:

Did a patient die or did they survive through 6 months of treatment? The patient can only be in only one of the categories. In some column of our data table, patients that died might have the label “died”, and those who have survived have the label “survived”.

Did a person experience more than one episode of psychosis in the last 5 years (“yes” or “no”)?

Did a person with a conviction for one offense offend again (“yes” or “no”)?

For this tutorial, we return to the chronic kidney disease dataset.

Each row in this dataset represents one patient.

For each patient, the doctors recorded whether or not the patient had chronic
kidney disease. This is a *binary categorical variable*; you can see the
values in the “Class” column. A value of 1 means the patient *did* have CKD; a
value of 0 means they *did not*. In this case we are labeling the categories
with numbers (1 / 0).

Many of the rest of the columns are measurements from blood tests and urine tests.

```
df = pd.read_csv('ckd_clean.csv')
df.head()
```

Age | Blood Pressure | Specific Gravity | Albumin | Sugar | Red Blood Cells | Pus Cell | Pus Cell clumps | Bacteria | Blood Glucose Random | ... | Packed Cell Volume | White Blood Cell Count | Red Blood Cell Count | Hypertension | Diabetes Mellitus | Coronary Artery Disease | Appetite | Pedal Edema | Anemia | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 48.0 | 70.0 | 1.005 | 4.0 | 0.0 | normal | abnormal | present | notpresent | 117.0 | ... | 32.0 | 6700.0 | 3.9 | yes | no | no | poor | yes | yes | 1 |

1 | 53.0 | 90.0 | 1.020 | 2.0 | 0.0 | abnormal | abnormal | present | notpresent | 70.0 | ... | 29.0 | 12100.0 | 3.7 | yes | yes | no | poor | no | yes | 1 |

2 | 63.0 | 70.0 | 1.010 | 3.0 | 0.0 | abnormal | abnormal | present | notpresent | 380.0 | ... | 32.0 | 4500.0 | 3.8 | yes | yes | no | poor | yes | no | 1 |

3 | 68.0 | 80.0 | 1.010 | 3.0 | 2.0 | normal | abnormal | present | present | 157.0 | ... | 16.0 | 11000.0 | 2.6 | yes | yes | yes | poor | yes | no | 1 |

4 | 61.0 | 80.0 | 1.015 | 2.0 | 0.0 | abnormal | abnormal | notpresent | notpresent | 173.0 | ... | 24.0 | 9200.0 | 3.2 | yes | yes | yes | poor | yes | yes | 1 |

5 rows × 25 columns

There are actually a large number of binary categorical variables in this dataset. For example, the “Hypertension” column has labels for the two categories “yes” (the patient had persistently high blood pressure) or “no”.

The categorical variable we are interested in here is “Appetite”. This has the label “good” for patients with good appetite, and “poor” for those with poor appetite. Poor appetite is a symptom of chronic kidney disease. In our case, we wonder whether the extent of kidney damage does a convincing job in predicting whether the patient has a “good” appetite.

As you remember, the CKD dataset has a column “Hemoglobin” that has the
concentration of hemoglobin from a blood sample. Hemoglobin is the molecule
that carries oxygen in the blood; it is the molecule that makes the red blood
cells red. Damaged kidneys produce lower concentrations of the hormone that
stimulates red blood cell production,
erythropoietin, so CKD
patients often have fewer red blood cells, and lower concentrations of
Hemoglobin. We will take lower “Hemoglobin” as a index of kidney damage.
Therefore, we predict that patients with lower “Hemoglobin” values are more
likely to have `poor`

“Appetite” values, and, conversely, patients with higher
“Hemoglobin” values are more likely to have `good`

“Appetite” values.

First we make a new data frame that just has the two columns we are interested in:

```
hgb_app = df.loc[:, ['Hemoglobin', 'Appetite']].copy()
hgb_app.head()
```

Hemoglobin | Appetite | |
---|---|---|

0 | 11.2 | poor |

1 | 9.5 | poor |

2 | 10.8 | poor |

3 | 5.6 | poor |

4 | 7.7 | poor |

## Dummy Variables¶

We will soon find ourselves wanting to do calculations on the values in the
“Appetite” column, and we cannot easily do that with the current string values
of “good” and “poor”. Our next step is to recode the string values to
numbers, ready for our calculations. We use 1 to represent “good” and 0 to
represent “poor”. This kind of recoding, where we replace category labels
with 1 and 0 values, is often called *dummy coding*.

```
# 'replace' replaces the values in the first argument with the corresponding
# values in the second argument.
hgb_app['appetite_dummy'] = hgb_app['Appetite'].replace(
['poor', 'good'],
[0, 1])
hgb_app.head()
```

Hemoglobin | Appetite | appetite_dummy | |
---|---|---|---|

0 | 11.2 | poor | 0 |

1 | 9.5 | poor | 0 |

2 | 10.8 | poor | 0 |

3 | 5.6 | poor | 0 |

4 | 7.7 | poor | 0 |

*Note*: When you are doing this, be sure to keep track of which label you have
coded as 1. Normally this would be the more interesting outcome. In this
case, “good” has the code 1. Keep track of the label corresponding to 1, as it
will affect the interpretation of the regression coefficients.

Now we have the dummy (1 or 0) variable, let us use a scatter plot to look at the relationship between hemoglobin concentration and whether a patient has a good appetite.

From the plot, it does look as if the patients with lower hemoglobin are more
likely to have poor appetite (`appetite_dummy`

values of 0), whereas patients
with higher hemoglobin tend to have good appetite (`appetite_dummy`

values of
1).

Now we start to get more formal, and develop a model with which we predict the
`appetite_dummy`

values from the `Hemoglobin`

values.

## How about linear regression?¶

Remember that, in linear regression, we predict scores on the *outcome*
variable (or column) using a straight-line relationship of the *predictor*
variable (or column).

Here are the predictor and outcome variables in our case.

```
# The x (predictor) and y (outcome) variables.
hemoglobin = hgb_app['Hemoglobin']
appetite_d = hgb_app['appetite_dummy']
```

Why not use the same linear regression technique for our case? After all, the
`appetite_d`

values are just numbers (0 and 1), as are our `hemoglobin`

values.

Earlier in the textbook, we performed linear regression by using `minimize`

, to
find the value of the slope and intercept of the line which gives the smallest
sum of the squared prediction errors.

Recall that, in linear regression:

*predicted* and *predictor variable* here are sequences of values, with one
value for each observation (row) in the dataset. In our case we have:

```
print('Rows in CKD data', len(hgb_app))
```

```
Rows in CKD data 158
```

“observations” (patients), so there will be the same number of scores on the
predictor variable (`hemoglobin`

), and the same number of predictions, in
`predicted`

. By contrast, the slope and intercept are single values, defining
the line.

We used `minimize`

to find the values of the slope and intercept which give the
“best” predictions. So far, we have almost invariably defined the *best*
values for slope and intercept as the values that give the smallest sum of the
squared prediction errors.

What would happen if we tried to use linear regression to predict the probability of having good appetite, based on hemoglobin concentrations?

Let us start by grabbing a version of the `ss_any_line`

function from the
Using minimize page.

```
def ss_any_line(c_s, x_values, y_values):
# c_s is a list containing two elements, an intercept and a slope.
intercept, slope = c_s
# Values predicted from these x_values, using this intercept and slope.
predicted = intercept + x_values * slope
# Difference of prediction from the actual y values.
error = y_values - predicted
# Sum of squared error.
return np.sum(error ** 2)
```

The sum of squared prediction error, in linear regression, is our *cost*
function. When we have a good pair of (intercept, slope) in `c_s`

, our function
is *cheap* - i.e. the returned value is small. When we have a bad pair in
`c_s`

, our function is *expensive* - the returned value is large.

If the value from `ss_any_line`

is large, it means the line we are fitting does
not fit the data well. The purpose of linear regression is to find the
line which leads to the smallest cost. In our case, the cost is the sum of the
squared prediction errors.

Let’s use linear regression on the current example. From looking at our plot above, we start with a guess of -1 for the intercept, and 0.1 for the slope.

```
# Use minimize to find the least sum of squares solution.
min_lin_reg = minimize(ss_any_line, [-1, 0.1], args=(hemoglobin, appetite_d))
# Show the results that came back from minimize.
min_lin_reg
```

```
fun: 10.315509522392956
hess_inv: array([[ 0.07498683, -0.00524735],
[-0.00524735, 0.00038337]])
jac: array([-1.87158585e-05, 5.72204590e-05])
message: 'Desired error not necessarily achieved due to precision loss.'
nfev: 138
nit: 2
njev: 42
status: 2
success: False
x: array([-0.0790431 , 0.07004938])
```

OK, so that looks hopeful. Using linear regression with `minimize`

we found that the sum of squared prediction errors was smallest for a line with:

```
# Unpack the slope and intercept estimates from the results object.
lin_reg_intercept, lin_reg_slope = min_lin_reg.x
# Show them.
print('Best linear regression intercept', lin_reg_intercept)
print('Best linear regression slope', lin_reg_slope)
```

```
Best linear regression intercept -0.07904310269972381
Best linear regression slope 0.07004938485490964
```

The linear regression model we are using here is:

```
predicted_appetite_d = intercept + slope * hemoglobin
```

Specifically:

```
predicted_lin_reg = lin_reg_intercept + lin_reg_slope * hemoglobin
predicted_lin_reg
```

```
0 0.705510
1 0.586426
2 0.677490
3 0.313233
4 0.460337
...
153 1.020732
154 1.076772
155 1.027737
156 0.915658
157 1.027737
Name: Hemoglobin, Length: 158, dtype: float64
```

Let’s plot our predictions, alongside the actual data. We plot the predictions from linear regression in orange.

The linear regression line looks plausible, as far as it goes, but it has
several unhappy features for our task of predicting the `appetite_d`

0 / 1
values.

One thing to like about the line is that the predictions are right to suggest
that the value of `appetite_d`

is more likely to be 1 (meaning “good”) at
higher values of `hemoglobin`

. Also, the prediction line slopes upward as
`hemoglobin`

gets higher, indicating that the probability of good appetite gets
higher as the hemoglobin concentration rises, across patients.

However, when the `hemoglobin`

gets higher than about 15.5, linear regression
starts to predict a value for `appetite_d`

that is greater than 1 - which, of
course, cannot occur in the `appetite_d`

values, which are restricted to 0 or
1.

Looking at the plot, without the regression line, it looks as if we can be fairly confident of predicting a 1 (“good”) value for a hemoglobin above 12.5, but we are increasingly less confident about predicting a 1 value as hemoglobin drops down to about 7.5, at which point we become confident about predicting a 0 value.

These reflections make as wonder whether we should be using something other than a simple, unconstrained straight line for our predictions.

## Another prediction line¶

Here’s another prediction line we might use for `appetite_d`

, with the
predicted values.

For the moment, please don’t worry about how we came by this line, we will come onto that soon.

The new prediction line is in gold.

```
# This is the machinery for making the sigmoid line of the plots below. We
# will come on that machinery soon. For now please ignore this code, and
# concentrate on the plots below.
def inv_logit(y):
""" Reverse logit transformation
"""
odds_ratios = np.exp(y) # Reverse the log operation.
return odds_ratios / (odds_ratios + 1) # Reverse odds ratios operation.
def params2pps(intercept, slope, x):
""" Calculate predicted probabilities of 1 for each observation.
"""
# Predicted log odds of being in class 1.
predicted_log_odds = intercept + slope * x
return inv_logit(predicted_log_odds)
# Some plausible values for intercept and slope.
nice_intercept, nice_slope = -7, 0.8
predictions_new = params2pps(nice_intercept, nice_slope, hemoglobin)
```

The new not-straight line seems to have much to recommend it. This shape of
line is called “sigmoid”, from the name of the Greek letter “s”. The sigmoid
prediction here never goes above 1 or below 0, so its values are always in the
range of the `appetite_d`

data it is trying to predict. It climbs steeply to a
prediction of 1, and plateaus there, as we get to the threshold of hemoglobin
around 12.5, at which every patient does seem to have “good” appetite
(`appetite_d`

of 1).

We can think of the values from the sigmoid curve as being *predicted
probabilities*. For example, at a `hemoglobin`

value of 10, the curve gives a
predicted y (`appetite_d`

) value of about 0.73. We can interpret this
prediction as saying that, with a hemoglobin value of 10, there is a
*probability* of about 0.73 that the corresponding patient will have a “good”
appetite (`appetite_d`

value of 1).

Now let us say that we would prefer to use this kind of sigmoid line to predict
`appetite_d`

. So far, we have only asked `minimize`

to predict directly from a
straight line - for example, in the `ss_any_line`

function. How can we get
minimize to predict from a family of sigmoid curves, as here? Is there a way
of transforming a sigmoid curve like the one here, with y values from 0 to 1,
to a straight line, where the y values can vary from large negative to large
positive? We would like to do such a conversion, so we have a slope and
intercept that `minimize`

can work with easily.

The answer, you can imagine, is “yes” — we can go from our sigmoid 0 / 1 curve to a straight line with unconstrained y values, in two fairly simple steps. The next sections will cover those steps. The two steps are:

Convert the 0 / 1

*probability*predictions to 0-to-large positive predictions of the*odds ratio*. The odds-ratio can vary from 0 to very large positive.Apply the

*logarithm*function to convert the 0-to-very-large-positive odds-ratio predictions to log odds-ratio predictions, which can vary from very large negative to very large positive.

These two transformations together are called the *log-odds* or
logit transformation. *Logistical
regression* is regression using the *logit* transform. Applying the logit
transform converts the sigmoid curve to a straight line.

We will explain more about the two stages of the transform below, but for now, here are the two stages in action.

This is the original sigmoid curve above, with the predictions, in its own plot.

Next we apply the conversion from probability to odds-ratio:

Notice that this is an *exponential* graph, where the y values increase more
and more steeply as the x values increase. We can turn exponential lines like
this one into straight lines, using the *logarithm* function, the next stage of
the logit transformation:

```
predictions_or_log = np.log(predictions_or)
plt.scatter(hemoglobin, predictions_or_log, color='gold')
fine_y_or_log = np.log(fine_y_or)
plt.plot(fine_x, fine_y_or_log, linewidth=1, linestyle=':')
plt.title('Logit prediction')
plt.xlabel('Hemoglobin')
plt.ylabel('Log odds-ratio');
```

Now we have a straight line, with an intercept and slope, suitable for
`minimize`

. The next few sections go into more detail on the odds-ratio and
logarithm steps.

## Probability and Odds¶

For logistic regression, in contrast to linear regression, we are interested in
predicting the *probability of an observation falling into a particular outcome
class* (0 or 1).

In this case, we are interested in the probability of a patient having “good” appetite, predicted from the patient’s hemoglobin.

We can think of probability as the *proportion of times* we expect to see a
particular outcome.

For example, there are 139 patients with “good” appetite in this data frame,
and 158 patients in total. If you were to repeatedly draw a single patient at
random from the data frame, and record their `Appetite`

, then we expect the proportion of “good” values in the long run, to be 139 / 158 — about 0.88. That is the same as saying there is a probability of 0.88 of a randomly-drawn patient of having a “good” appetite.

Because the patient’s appetite can only be “good” or “poor”, and because the probabilities of all possible options have to add up to 1, the probability of the patient having a “poor” appetite is 1 - 0.88 — about 0.12.

So, the probability can express the *proportion of times* we expect to see some
*event of interest* - in our case, the event of interest is “good” in the
`Appetite`

column.

We can think of this same information as an *odds ratio*.

We often express probabilities as odds ratios. For example, we might say that
the odds are 5 to 1 that it will rain today. We mean that it is five times
more likely that it will rain today than that it will not rain today. On
another day we could estimate that the odds are 0.5 to 1 that it will rain
today, meaning that it is twice as likely *not* to rain, as it is to rain.

The odds ratio is the number of times we expect to see the event of interest (e.g. rain) for every time we expect to see the event of no interest (not rain).

This is just a way of saying a probability in a different way, and we can convert easily between probabilities and odds ratios.

To convert from a probability to an odds ratio, we remember that the odds ratio is the number of times we expect to see the event of interest for every time we expect to see the event of no interest. This is the probability (proportion) for the event of interest, divided by the probability (proportion) of the event of no interest. Say the probability p is some value (it could be any value):

```
# Our proportion of interest.
p = 0.88
```

Then the equivalent odds ratio `odds_ratio`

is:

```
# Odds ratio is proportion of interest, divided by proportion of no interest.
odds_ratio = p / (1 - p)
odds_ratio
```

```
7.333333333333334
```

Notice the odds ratio is greater than one, because `p`

was greater than 0.5. An
odds ratio of greater than one means the event of interest is more likely to
happen than the alternative. An odds ratio of less than one means p was less
than 0.5, and the event of interest is less likely to happen than the
alternative. A p value of 0 gives an odds ratio of 0. As the p value gets close to 1, the odds ratio gets very large.

```
p = 0.999999
p / (1 - p)
```

```
999998.9999712444
```

We can also convert from odds ratios to p values. Remember the odds ratio is
the number of times we expect to see the event of interest, divided by the
number of times we expect to see the event of no interest. The probability is
the proportion of all events that are events of interest. We can read an
`odds_ratio`

of - say - 2.5 as “the chances are 2.5 to 1”. To get the
probability we divide the number of times we expect to see the event of
interest - here 2.5 - by the number of events in total. The number of events
in total is just the number of events of interest - here 2.5 - plus the number
of events of no interest - here 1. So:

```
p_from_odds_ratio = odds_ratio / (odds_ratio + 1)
p_from_odds_ratio
```

```
0.88
```

Summary - convert probabilities to odds ratios with:

```
odds_ratio = p / (1 - p)
```

Convert odds ratios to probabilities with:

```
p = odds_ratio / (odds_ratio + 1)
```

As you’ve seen, when we apply the conversion above, to convert the probability values to odds ratios for our appetite predictions, we get the following:

Notice that the odds ratios we found vary from very close to 0 (for p value predictions very close to 0) to very large (for p values very close to 1).

Notice too that our graph looks exponential, and we want it to be a straight
line. Our next step is to apply a *logarithm* transformation.

## The logarithm transform¶

See the logarithm refresher page for more background on logarithms.

For now, the only thing you need to know about logarithms is that they are transformations that convert an exponential curve into a straight line.

Here’s an example:

You have already see logs in action transforming the odds-ratio predictions.

## The logit transform and its inverse¶

The logit transformation from the sigmoid curve to the straight line consists of two steps:

Convert probability to odd-ratios.

Take the log of the result.

The full logit transformation is therefore:

```
def logit(p):
""" Apply logit transformation to array of probabilities `p`
"""
odds_ratios = p / (1 - p)
return np.log(odds_ratios)
```

Here are the original sigmoid predictions and the predictions with the `logit`

transform applied:

We also want to be able to go backwards, from the straight-line predictions, to the sigmoid predictions.

`np.exp`

reverses (inverts) the `np.log`

transformation (see the logarithm
refresher page):

```
# np.exp reverses the effect of np.log.
some_values = np.array([1, 0.5, 3, 6, 0.1])
values_back = np.exp(np.log(some_values))
values_back
```

```
array([1. , 0.5, 3. , 6. , 0.1])
```

You have seen above that there is a simple formula to go from odds ratios to
probabilities. The transformation that *reverses* (inverts) the logit
transform is therefore:

```
def inv_logit(v):
""" Reverse logit transformation on array `v`
"""
odds_ratios = np.exp(v) # Reverse the log operation.
return odds_ratios / (odds_ratios + 1) # Reverse odds ratios operation.
```

`inv_logit`

takes points on a straight line, and converts them to points on a
sigmoid.

First we convince ourselves that `inv_logit`

does indeed reverse the `logit`

transform:

```
some_p_values = np.array([0.01, 0.05, 0.1, 0.5, 0.9, 0.95, 0.99])
some_log_odds = logit(some_p_values)
back_again = inv_logit(some_log_odds)
print('Logit, then inv_logit returns the original data', back_again)
```

```
Logit, then inv_logit returns the original data [0.01 0.05 0.1 0.5 0.9 0.95 0.99]
```

The plot above has the sigmoid curve p-value predictions, and the p-value predictions with the logit transformation applied.

Next you see the logit-transformed results on the left. The right shows the
results of applying `inv_logit`

to recover the original p values.

## Effect of the Logit slope and intercept on the sigmoid¶

Changing the intercept of the logit (log-odds) straight line moves the corresponding inverse logit sigmoid curve left and right on the horizontal axis:

```
for intercept in [-6, -7, -8]:
plt.plot(fine_x,
params2pps(intercept, 0.8, fine_x),
linewidth=1,
linestyle=':',
label='logit intercept=%d' % intercept)
plt.title('Sigmoid probability predictions')
plt.xlabel('Hemoglobin')
plt.ylabel('Probability prediction')
plt.legend();
```

Changing the slope of the logit straight line makes the transition from 0 to 1 flatter or steeper:

```
for slope in [0.6, 0.8, 1.0]:
plt.plot(fine_x,
params2pps(-7, slope, fine_x),
linewidth=1,
linestyle=':',
label='logit slope=%.1f' % slope)
plt.title('Sigmoid probability predictions')
plt.xlabel('Hemoglobin')
plt.ylabel('Probability prediction')
plt.legend();
```

## A first-pass at logistic regression¶

You may now see how we could use `minimize`

with a slope and intercept.

Remember that `minimize`

needs a *cost function* that takes some parameters -
in our case, the intercept and slope — and returns a score, or cost for the
parameters.

In the `ss_any_line`

cost function above, the score it returns is just the sum
of squared prediction errors from the line. But the cost function can do
anything it likes to generate a score from the line.

In our case, we’re going to make a cost function that takes the intercept and
slope, and generates predictions using the intercept and slope and `hemoglobin`

values. But these predictions are for the log-odds transformed values, on the
straight line. So the cost function then converts the predictions into p value
predictions, on the corresponding sigmoid, and compares these predictions to the 0 / 1 values in `appetite_d`

.

For example, let’s say we want to get a score for the intercept -7 and the slope 0.8.

First we get the straight-line predictions in the usual way:

```
intercept, slope = -7, 0.8
sl_predictions = intercept + slope * hemoglobin
plt.scatter(hemoglobin, sl_predictions)
plt.xlabel('Hemoglobin')
plt.ylabel('Straight-line (log-odds) predictions');
```

These are predictions on the straight line, but we now need to transform them
to p value (0 to 1) predictions on the sigmoid. We use `inv_logit`

:

```
sigmoid_predictions = inv_logit(sl_predictions)
plt.scatter(hemoglobin, sigmoid_predictions)
plt.xlabel('Hemoglobin')
plt.ylabel('Sigmoid (p-value) predictions');
```

Finally, we want to compare the predictions to the actual data to get a score. One way we could do this is our good old sum of squares difference between the sigmoid p-value predictions and the 0 / 1 values:

```
sigmoid_error = appetite_d - sigmoid_predictions
sigmoid_score = np.sum(sigmoid_error ** 2)
sigmoid_score
```

```
10.044653157567184
```

Next we make the cost function that minimize will use. It must accept an array of parameters (an intercept and slope), and calculate the sum of squares error, using the predicted p-values:

```
def ss_logit(c_s, x_values, y_values):
# Unpack intercept and slope into values.
intercept, slope = c_s
# Predicted values for log-odds straight line.
predicted_log_odds = intercept + slope * x_values
# Predicted p values on sigmoid.
pps = inv_logit(predicted_log_odds)
# Prediction errors.
sigmoid_error = y_values - pps
# Sum of squared error
return np.sum(sigmoid_error ** 2)
```

We check our function gives the same results as the step-by-step calculation above.

```
ss_logit([-7, 0.8], hemoglobin, appetite_d)
```

```
10.044653157567184
```

Notice what is happening here. The cost function gets the new intercept and slope to try, makes the predictions from the intercept and slope, converts the predictions to probabilities on the sigmoid, and tests those against the real labels.

Now let’s see the cost function in action:

```
min_res_logit = minimize(ss_logit, [-7, 0.8], args=(hemoglobin, appetite_d))
min_res_logit
```

```
fun: 9.621504133676334
hess_inv: array([[ 6.28125352, -0.59480177],
[-0.59480177, 0.05889232]])
jac: array([1.19209290e-07, 2.38418579e-07])
message: 'Optimization terminated successfully.'
nfev: 42
nit: 12
njev: 14
status: 0
success: True
x: array([-5.28090393, 0.58493976])
```

Does this result look like it gives more convincing sigmoid predictions than our guessed intercept an slope of -7 and 0.8?

First get the sigmoid predictions from this line:

```
logit_ss_inter, logit_ss_slope = min_res_logit.x
# Predicted values on log-odds straight line.
predicted_log_odds = logit_ss_inter + logit_ss_slope * hemoglobin
# Predicted p values on sigmoid.
logit_ss_pps = inv_logit(predicted_log_odds)
```

Then plot:

## A different measure of prediction error¶

Our sigmoid prediction from sum of squares above looks convincing enough, but, is there a better way of scoring the predictions from our line, than sum of squares?

It turns out there is another quite different and useful way to score the
predictions, called *likelihood*. For reasons we discuss in this
page, all standard implementations of logistic regression
that we know of, use the *likelihood* measure that we describe below, instead
of the sum of squares measure you see above.

Likelihood asks the question: assuming our predicting line, how likely are the sequence of actual 0 / 1 values that we see?

To answer this question, we first ask this question about the individual 0 / 1 values.

We start with our intercept of -7 and slope of 0.8 for the straight-line log-odds values. We generate the straight-line predictions, then convert them to sigmoid p-value predictions.

```
log_odds_predictions = -7 + hemoglobin * 0.8
sigmoid_p_predictions = inv_logit(log_odds_predictions)
```

Remember, these are the predicted probabilities of a 1 label. We rename to remind ourselves:

```
pp_of_1 = sigmoid_p_predictions
```

We put these predictions into a copy of our data set to make them easier to display:

```
hgb_predicted = hgb_app.copy()
hgb_predicted['pp_of_1'] = pp_of_1
hgb_predicted.head()
```

Hemoglobin | Appetite | appetite_dummy | pp_of_1 | |
---|---|---|---|---|

0 | 11.2 | poor | 0 | 0.876533 |

1 | 9.5 | poor | 0 | 0.645656 |

2 | 10.8 | poor | 0 | 0.837535 |

3 | 5.6 | poor | 0 | 0.074468 |

4 | 7.7 | poor | 0 | 0.301535 |

Consider the last row in this data frame:

```
last_row = hgb_predicted.tail(1)
last_row
```

Hemoglobin | Appetite | appetite_dummy | pp_of_1 | |
---|---|---|---|---|

157 | 15.8 | good | 1 | 0.99646 |

The `pp_of_1`

value is the probability that we will see a label of 1 for this
observation. The outcome was, in fact, 1. Therefore the predicted
probability that we will see the actual value is just the corresponding
`pp_of_1`

value:

```
pp_of_last_value = pp_of_1.iloc[-1]
pp_of_last_value
```

```
0.9964597097790535
```

Now consider the first row:

```
first_row = hgb_predicted.head(1)
first_row
```

Hemoglobin | Appetite | appetite_dummy | pp_of_1 | |
---|---|---|---|---|

0 | 11.2 | poor | 0 | 0.876533 |

The probability for an `appetite_dummy`

value of 1 is:

```
pp_of_1[0]
```

```
0.876532952434776
```

Because the values can only be 0 or 1, the predicted probability of an
`appetite_dummy`

of 0 must be 1 minus the (predicted probability of 1):

```
pp_of_first_value = 1 - pp_of_1[0]
pp_of_first_value
```

```
0.12346704756522398
```

We can therefore calculate the predicted probability of each 0 / 1 label in the data frame like this:

```
# Predicted probability of 0.
pp_of_0 = 1 - pp_of_1
# The next line sets all the label == 1 values correctly, but
# the label == 0 values are now incorrect.
pp_of_label = pp_of_1.copy()
# Set the label == 0 values correctly.
poor_appetite = appetite_d == 0
pp_of_label[poor_appetite] = pp_of_0[poor_appetite]
# Put this into the data frame for display
hgb_predicted['pp_of_label'] = pp_of_label
hgb_predicted
```

Hemoglobin | Appetite | appetite_dummy | pp_of_1 | pp_of_label | |
---|---|---|---|---|---|

0 | 11.2 | poor | 0 | 0.876533 | 0.123467 |

1 | 9.5 | poor | 0 | 0.645656 | 0.354344 |

2 | 10.8 | poor | 0 | 0.837535 | 0.162465 |

3 | 5.6 | poor | 0 | 0.074468 | 0.925532 |

4 | 7.7 | poor | 0 | 0.301535 | 0.698465 |

... | ... | ... | ... | ... | ... |

153 | 15.7 | good | 1 | 0.996166 | 0.996166 |

154 | 16.5 | good | 1 | 0.997975 | 0.997975 |

155 | 15.8 | good | 1 | 0.996460 | 0.996460 |

156 | 14.2 | good | 1 | 0.987383 | 0.987383 |

157 | 15.8 | good | 1 | 0.996460 | 0.996460 |

158 rows × 5 columns

There’s a fancy short-cut to the several lines above, that relies on the fact
that multiplying by 0 gives 0. We can form the `pp_of_label`

column by first
multiplying the 0 / 1 label column values by the values in `pp_of_1`

. This
correctly sets the `pp_of_label`

values for labels of 1, and leaves the
remainder as 0. Then we reverse the 0 / 1 labels by subtracting from 1, and
multiply by `pp_of_0`

to set the 0 values to their correct values. Adding
these two results gives us the correct set of values for both 1 and 0 labels.

```
# Set label == 1 values to predicted p, others to 0
pos_ps = appetite_d * pp_of_1
# Set label == 0 values to 1-predicted p, others to 0
neg_ps = (1 - appetite_d) * pp_of_0
# Combine by adding.
final = pos_ps + neg_ps
# This gives the same result as the code cell above:
hgb_predicted['pp_of_label_fancy'] = final
hgb_predicted
```

Hemoglobin | Appetite | appetite_dummy | pp_of_1 | pp_of_label | pp_of_label_fancy | |
---|---|---|---|---|---|---|

0 | 11.2 | poor | 0 | 0.876533 | 0.123467 | 0.123467 |

1 | 9.5 | poor | 0 | 0.645656 | 0.354344 | 0.354344 |

2 | 10.8 | poor | 0 | 0.837535 | 0.162465 | 0.162465 |

3 | 5.6 | poor | 0 | 0.074468 | 0.925532 | 0.925532 |

4 | 7.7 | poor | 0 | 0.301535 | 0.698465 | 0.698465 |

... | ... | ... | ... | ... | ... | ... |

153 | 15.7 | good | 1 | 0.996166 | 0.996166 | 0.996166 |

154 | 16.5 | good | 1 | 0.997975 | 0.997975 | 0.997975 |

155 | 15.8 | good | 1 | 0.996460 | 0.996460 | 0.996460 |

156 | 14.2 | good | 1 | 0.987383 | 0.987383 | 0.987383 |

157 | 15.8 | good | 1 | 0.996460 | 0.996460 | 0.996460 |

158 rows × 6 columns

We can do the whole cell above in one line:

```
# Compact version of pp of label calculation.
final_again = appetite_d * pp_of_1 + (1 - appetite_d) * (1 - pp_of_1)
hgb_predicted['pp_again'] = final_again
hgb_predicted
```

Hemoglobin | Appetite | appetite_dummy | pp_of_1 | pp_of_label | pp_of_label_fancy | pp_again | |
---|---|---|---|---|---|---|---|

0 | 11.2 | poor | 0 | 0.876533 | 0.123467 | 0.123467 | 0.123467 |

1 | 9.5 | poor | 0 | 0.645656 | 0.354344 | 0.354344 | 0.354344 |

2 | 10.8 | poor | 0 | 0.837535 | 0.162465 | 0.162465 | 0.162465 |

3 | 5.6 | poor | 0 | 0.074468 | 0.925532 | 0.925532 | 0.925532 |

4 | 7.7 | poor | 0 | 0.301535 | 0.698465 | 0.698465 | 0.698465 |

... | ... | ... | ... | ... | ... | ... | ... |

153 | 15.7 | good | 1 | 0.996166 | 0.996166 | 0.996166 | 0.996166 |

154 | 16.5 | good | 1 | 0.997975 | 0.997975 | 0.997975 | 0.997975 |

155 | 15.8 | good | 1 | 0.996460 | 0.996460 | 0.996460 | 0.996460 |

156 | 14.2 | good | 1 | 0.987383 | 0.987383 | 0.987383 | 0.987383 |

157 | 15.8 | good | 1 | 0.996460 | 0.996460 | 0.996460 | 0.996460 |

158 rows × 7 columns

When the `pp_of_label`

values are near 1, this means the result was close to
the prediction. When they are near 0, it means the result was unlike the
prediction.

Now we know the probabilities of each actual 0 and 1 label, we can get a
measure of how *likely* the *combination* of all these labels are, given these
predictions. We do this by *multiplying* all the probabilities in
`pp_of_label`

. When the probabilities in `pp_of_label`

are all fairly near 1,
multiplying them will give a number that is not very small. When some or many
of the probabilities are close to 0, the multiplication will generate a very
small number.

The result of this multiplication is called the *likelihood* of this set of
labels, given the predictions.

If the predictions are closer to the actual values, the likelihood will be larger, and closer to 1. When the predictions are not close, the likelihood will be low, and closer to 0.

Here is the likelihood for our intercept of -7 and slope of 0.8.

```
# np.prod multiplies each number to give the product of all the numbers.
likelihood = np.prod(pp_of_label)
likelihood
```

```
1.463854029404182e-13
```

This is a very small number. Likelihoods are often close to 0 with a
reasonable number of points, and somewhat inexact prediction, because there
tend to be a reasonable number of small p values in `pp_of_labels`

. The
question is - do other values for the intercept and slope give smaller or
larger values for the likelihood?

We can put this *likelihood* scoring into our cost function, instead of using
scoring with the sum of squares.

One wrinkle is that we want our cost function value to be *lower* when the
line is a good predictor, but the likelihood is *higher* when the line is a
good predictor. We solve this simply by sticking a minus on the likelihood
before we return from the cost function. This makes `minimize`

find the
parameters giving the *minimum* of the *negative likelihood*, and therefore,
the *maximum likelihood* (ML).

```
def simple_ml_logit_cost(intercept_and_slope, x, y):
""" Simple version of cost function for maximum likelihood
"""
intercept, slope = intercept_and_slope
# Make predictions for sigmoid.
predicted_log_odds = intercept + slope * x
pp_of_1 = inv_logit(predicted_log_odds)
# Calculate predicted probabilities of actual labels.
pp_of_labels = y * pp_of_1 + (1 - y) * (1 - pp_of_1)
likelihood = np.prod(pp_of_labels)
# Ask minimize to find maximum by adding minus sign.
return -likelihood
```

We can try that cost function, starting at our familiar intercept of -7 and slope of 0.8.

Before we do, there is one extra wrinkle, that we will solve in another way
further down the notebook. At the moment the likelihood values we are
returning are very small - in the order of \(10^{-13}\). We expect scores for
different plausible lines to differ from each other by an even smaller number.
By default, `minimize`

will treat these very small differences as
insignificant. We can tell `minimize`

to pay attention to these tiny
differences by passing a very small value to the `tol`

parameter (`tol`

for
*tolerance*).

```
mr_ML = minimize(simple_ml_logit_cost, # Cost function
[-7, 0.8], # Guessed intercept and slope
args=(hemoglobin, appetite_d), # x and y values
tol=1e-16) # Attend to tiny changes in cost function values.
# Show the result.
mr_ML
```

```
fun: -2.1171640882196986e-13
hess_inv: array([[4.57223749e+07, 4.62358140e+08],
[4.62358140e+08, 4.67550198e+09]])
jac: array([ 2.27949458e-14, -2.25416304e-15])
message: 'Desired error not necessarily achieved due to precision loss.'
nfev: 145
nit: 6
njev: 45
status: 2
success: False
x: array([-7.00275132, 0.77217755])
```

`minimize`

complains that it was not happy with the numerical accuracy of its
answer, and we will address that soon. For now, let us look at the predictions
to see if they are reasonable.

```
inter_ML, slope_ML = mr_ML.x
predicted_ML = inv_logit(inter_ML + slope_ML * hemoglobin)
```

## A computation trick for the likelihood¶

As you have seen, the likelihood values can be, and often are, very small, and close to zero.

You may also know that standard numerical calculations on computers are not completely precise; the calculations are only accurate to around 16 decimal places. This is because of the way computers store floating point numbers. You can find more detail in this page on floating point numbers.

The combination of small likelihood values, and limited calculation precision, can be a problem for the simple ML logistic cost function you see above. As a result, practical implementations of logistic regression use an extra trick to improve the calculation accuracy of the likelihoods.

This trick also involves logarithms.

Here we show a very important property of the logarithm transform:

```
print('Multiplying two numbers', 11 * 15)
print('Take logs, add logs, unlog', np.exp(np.log(11) + np.log(15)))
```

```
Multiplying two numbers 165
Take logs, add logs, unlog 165.00000000000009
```

What you see here is that we get (almost) the same answer if we:

Multiply two numbers OR

If we take the logs of the two numbers,

*add them*, then reverse the log operation, in this case with`np.exp`

.

We get *almost* the same number because of the limitations of the precision of
the calculations. For our cases, we do not need to worry about these tiny
differences.

The log-add-unlog trick means that we can replace multiplication by addition, if we take the logs of the values.

Here we do the same trick on an array of numbers:

```
some_numbers = np.array([11, 15, 1, 0.3])
print('Product of the array', np.prod(some_numbers))
print('Log-add-unlog on the array', np.exp(np.sum(np.log(some_numbers))))
```

```
Product of the array 49.5
Log-add-unlog on the array 49.500000000000036
```

We can use this same trick to calculate our likelihood by adding logs instead of multiplying the probabilities directly:

```
print('Likelihood with product of array', np.prod(pp_of_label))
# The log-add-unlog version.
logs_of_pp = np.log(pp_of_label)
log_likelihood = np.sum(logs_of_pp)
print('Likelihood with log-add-unlog', np.exp(log_likelihood))
```

```
Likelihood with product of array 1.463854029404182e-13
Likelihood with log-add-unlog 1.4638540294041883e-13
```

This doesn’t seem to have solved all our problem, because we still end up with
the type of tiny number that confuses `minimize`

.

The next stage of the solution is to realize that the `minimize`

does not need
the *actual likelihood*, it needs some number that goes *up or down in exactly
the same way as likelihood*. If the cost function can return some number that
goes up when likelihood goes up, and goes down when likelihood goes down,
`minimize`

will still find the same parameters to minimize this cost function.

Specifically we want the value that comes back from the cost function to vary
monotonically with respect
to the likelihood. Consider any two pairs of (intercept, slope). Call the
first pair `p1`

and the second pair `p2`

. Say `p1`

gives a likelihood value of
`L1`

and `p2`

gives a likelihood value of `L2`

. Now say we’ve got another way
of calculating a value for `p1`

and `p2`

, and the corresponding values that
come back from the new calculation are `N1`

and `N2`

. Remember `miminize`

is
searching for the (intercept, slope) pair that gives the lowest value from the
cost function. So, we can swap the new calculation into our cost function, and
get the same parameters back from `miminize`

, as long as it is always true
that:

when

`L1 > L2`

then`N1 > N2`

;when

`L1 < L2`

then`N1 < N2`

; andwhen

`L1 == L2`

,`N1 == N2`

.

In that case we call our new calculation generating `N1`

and `N2`

*monotonic*
with respect to the original calculation generating `L1`

and `L2`

. In our case
this original calculation is likelihood. The new monotonic calculation is the
*log likelihood*.

In the plot below, you can see it is true that when the likelihood goes up,
then the log of the likelihood will also go up, and vice versa, so the log the
likelihood *is* monotonic with respect to the likelihood, and we can use it
instead of the likelihood, in our cost function.

```
likelihood_values = np.linspace(0, 1, 1000)
plt.plot(likelihood_values, np.log(likelihood_values))
plt.xlabel('Likelihood')
plt.ylabel('Log likelihood');
```

```
<ipython-input-62-e831d14c95fa>:2: RuntimeWarning: divide by zero encountered in log
plt.plot(likelihood_values, np.log(likelihood_values))
```

This means that our cost function does not have to do the last nasty unlog step above, that generates the tiny value for likelihood. We can just return the (minus of the) log of the likelihood. The log of the likelihood turns out to be a manageable negative number, even when the resulting likelihood is tiny.

```
log_likelihood
```

```
-29.552533505045645
```

This trick gives us a much more tractable number to return from the cost function, because the log likelihood is much less affected by errors from lack of precision in the calculation.

```
def mll_logit_cost(intercept_and_slope, x, y):
""" Cost function for maximum log likelihood
Return minus of the log of the likelihood.
"""
intercept, slope = intercept_and_slope
# Make predictions for sigmoid.
predicted_log_odds = intercept + slope * x
pp_of_1 = inv_logit(predicted_log_odds)
# Calculate predicted probabilities of actual labels.
pp_of_labels = y * pp_of_1 + (1 - y) * (1 - pp_of_1)
# Use logs to calculate log of the likelihood
log_likelihood = np.sum(np.log(pp_of_labels))
# Ask minimize to find maximum by adding minus sign.
return -log_likelihood
```

Use the new cost function:

```
mr_MLL = minimize(mll_logit_cost, # Cost function
[-7, 0.8], # Guessed intercept and slope
args=(hemoglobin, appetite_d), # x and y values
)
# Show the result.
mr_MLL
```

```
fun: 29.167993895857247
hess_inv: array([[ 2.74626368, -0.25696584],
[-0.25696584, 0.02508566]])
jac: array([4.76837158e-07, 4.76837158e-07])
message: 'Optimization terminated successfully.'
nfev: 27
nit: 7
njev: 9
status: 0
success: True
x: array([-7.29187264, 0.79915442])
```

Notice that we did not have to tell `minimize`

to use a very small value for
the tolerance this time.

The values that come back are very similar to our previous, more fragile version that used the likelihood:

```
inter_MLL, slope_MLL = mr_MLL.x
predicted_MLL = inv_logit(inter_MLL + slope_MLL * hemoglobin)
```

You have just seen the standard calculations behind most packages that implement logistic regression.

To show this is standard, let us do the same regression in Statsmodels

## Logistic Regression with Statsmodes¶

As with linear regression, we can easily perform logistic regression using Statsmodels.

```
# Get the formula interface for Statsmodels
import statsmodels.formula.api as smf
```

```
# Create the model.
log_reg_mod = smf.logit('appetite_dummy ~ Hemoglobin', data=hgb_app)
# Fit it.
fitted_log_reg_mod = log_reg_mod.fit()
fitted_log_reg_mod.summary()
```

```
Optimization terminated successfully.
Current function value: 0.184608
Iterations 8
```

Dep. Variable: | appetite_dummy | No. Observations: | 158 |
---|---|---|---|

Model: | Logit | Df Residuals: | 156 |

Method: | MLE | Df Model: | 1 |

Date: | Tue, 04 Jan 2022 | Pseudo R-squ.: | 0.4976 |

Time: | 19:02:40 | Log-Likelihood: | -29.168 |

converged: | True | LL-Null: | -58.054 |

Covariance Type: | nonrobust | LLR p-value: | 2.944e-14 |

coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

Intercept | -7.2919 | 1.659 | -4.396 | 0.000 | -10.543 | -4.041 |

Hemoglobin | 0.7992 | 0.158 | 5.042 | 0.000 | 0.489 | 1.110 |

Notice that Statsmodels lists the “Model” as “Logit” and the “Method” as “MLE” — Maximum Likelihood Estimation.

Look at the table above under ‘coef’. Compare the logistic regression intercept
and slope that Statsmodels found to the ones we got from `minimize`

and the
maximum log likelihood (MLL) cost function.

```
print('Intercept from minimize =', inter_MLL)
print('Slope from minimize =', slope_MLL)
```

```
Intercept from minimize = -7.2918726387296235
Slope from minimize = 0.7991544178732585
```

Finally, we can use the `predict`

method of Statsmodels to generate predicted
probabilities from the logistic regression model we have just fitted:

```
sm_predictions = fitted_log_reg_mod.predict(hgb_app['Hemoglobin'])
sm_predictions.head()
```

```
0 0.840058
1 0.574466
2 0.792325
3 0.056433
4 0.242617
dtype: float64
```

Let us plot the predicted probabilities of having “good” appetite, from Statsmodels:

We can see graphically that these predictions look identical to the ones we obtained from minimize.

Let us see what the largest absolute difference between the predictions from the two methods is:

```
np.max(np.abs(predicted_MLL - sm_predictions))
```

```
1.0738044586844353e-07
```

That is very close to 0. The models are making almost identical predictions.

## Summary¶

This tutorial has shown you how to do binary logistic regression with one numerical predictor variable.

## Exercises for reflection¶

You might want to read the logarithm refresher page for these questions.

Consider the expression

`np.exp(y)`

in the`inv_logit`

expression above. This means “raise`np.e`

to the power of the values in`y`

”, and could equally well be written as`np.e ** y`

(it would give exactly the same result). Now consider changing`np.exp(y)`

to`10 ** y`

. Now we are working in base 10 instead of base`np.e`

. What would happen to:The best-fit p values predictions returned via

`minimize`

?The slope and intercept of the best-fit log-odds straight line?

Reflect, using the reasoning here, then try it and see.

Consider the expression

`np.sum(np.log(pp_of_labels))`

in the`mll_logit_cost`

cost function. This is using`np.log`

, which is log to the base`np.e`

. Consider what would happen if you change this expression to`np.sum(np.log10(pp_of_labels))`

(log to the base 10). What would happen to:The best-fit p values predictions returned via

`minimize`

?The slope and intercept of the best-fit log-odds straight line?

Reflect, using the reasoning here, then try it and see.