In [None]:
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import pandas as pd

## Describing distributions

We have seen several examples of *distributions*.

We can describe distributions as having a *center*, and a *spread*.

In [the mean as predictor](mean_meaning), we saw that the mean is
a useful measure of the center of a distribution.

What measure should we use for the spread?

## Back to chronic kidney disease

We return to the [data on chronic kidney disease](https://matthew-brett.github.io/dsfe2019/data/chronic_kidney_disease).

Download the data to your computer via this link: [ckd_clean.csv](https://matthew-brett.github.io/dsfe2019/data/ckd_clean.csv).

In [None]:
ckd_full = pd.read_csv('ckd_clean.csv')
ckd_full.head()

We will use this dataset to get a couple of variables (columns) and
therefore a couple of distributions.

Let's start with the White Blood Cell Count, usually abbreviated as WBC.

In [None]:
wbc = ckd_full['White Blood Cell Count']
wbc.hist()
plt.title('White Blood Cell Count');

In [None]:
wbc.describe()

Compare this to Hemoglobin concentrations:

In [None]:
hgb = ckd_full['Hemoglobin']
hgb.hist()
plt.title('Hemoglobin');

In [None]:
hgb.describe()

Notice that we can't easily plot these two on the same axes, because
their units are so different.

Here's what that looks like.  Notice that the hemoglobin values disappear in a tiny spike to the left.

In [None]:
# Use alpha to make the histograms a little transparent.
# Label them for a legend.
hgb.hist(alpha=0.7, label='HGB')
wbc.hist(alpha=0.7, label='WBC')
plt.title("HGB and WBC together - HGB tiny spike at left")
plt.legend();

We could try and fix this by subtracting the mean, as a center value, so
the values are now *deviations* from the mean.

In [None]:
wbc_deviations = wbc - np.mean(wbc)
wbc_deviations.hist()
plt.title('White Blood Cell Count deviations');

In [None]:
hgb_deviations = hgb - np.mean(hgb)
hgb_deviations.hist()
plt.title('Hemoglobin deviations');

The deviations each have a mean very very close to zero, and therefore,
they have the same center:

In [None]:
np.mean(wbc_deviations), np.mean(hgb_deviations)

We still cannot sensibly plot them on the same axes, because the WBC values have
a very different *spread*.  The WBC values completely dominate the x axis of
the graph.  We can't reasonably compare the WBC deviations to the
Hemoglobin deviations, because they have such different *units*.

In [None]:
hgb_deviations.hist(alpha=0.7, label='HGB')
wbc_deviations.hist(alpha=0.7, label='WBC')
plt.title("HGB and WBC deviations - you can't see HGB")
plt.legend();

We would like a measure of the spread of the distribution, so we can set
the two distributions to have the same spread.

## The standard deviation

In the [mean as predictor](mean_meaning) section, we found that mean was
the best value to use as a predictor, to minimize the sum of *squared*
deviations.

Maybe we could get an idea of the typical *squared* deviation, as
a measure of spread?

In [None]:
hgb_deviations[:10]

In [None]:
hgb_dev_sq = hgb_deviations ** 2
hgb_dev_sq[:10]

In [None]:
hgb_dev_sq.hist()
plt.title('HGB squared deviations')

The center, or typical value, of this distribution, could be the *mean*.

In [None]:
hgb_dev_sq_mean = np.mean(hgb_dev_sq)
hgb_dev_sq_mean

This is the *mean squared deviation*.  This is also called the
*variance*.  Numpy has a function to calculate that in one shot:

In [None]:
# The mean squared deviation is the variance
np.var(hgb)

The mean squared deviation is a good indicator of the typical squared
deviation.  What should we use for some measure of the typical
deviation?

We could take the square root of the mean squared deviation, like this:

In [None]:
np.sqrt(hgb_dev_sq_mean)

This is a measure of the spread of the distribution.  It is a measure of
the typical or average deviation.

It is also called the *standard deviation*.

In [None]:
np.std(hgb)

We can make our distribution have a standard center *and* a standard
spread by dividing our mean-centered distribution, by the standard
deviation.  Then the distribution will have a standard deviation very
close to 1.

This version of the distribution, with mean 0 and standard deviation of
1, is called the *standardized* distribution.

In [None]:
standardized_hgb = hgb_deviations / np.std(hgb)
standardized_hgb.hist()
plt.title('Standardized Hemoglobin')

We can make a function to do this:

In [None]:
def standard_units(x):
    return (x - np.mean(x))/np.std(x)

In [None]:
std_hgb_again = standard_units(hgb)
std_hgb_again.hist()
plt.title('Standardized Hemoglobin, again')

If we do the same to the WBC, we can compare values of the
distributions:

In [None]:
std_wbc = standard_units(wbc)
std_wbc.hist()
plt.title('Standardized White Blood Cell Count')

Now we can put both these distributions on the same graph, to compare them directly.

In [None]:
std_hgb_again.hist(alpha=0.7, label='HGB')
std_wbc.hist(alpha=0.7, label='WBC')
plt.title('Standardized HGB and WBC')
plt.legend()

Every value in standardized units gives the deviation of the original
value from its mean, in terms of the number of standard deviations.