#############################################
Comparing two groups with permutation testing
#############################################

.. code-links:: clear

We return to the Brexit age data from :doc:`brexit_problem`.

Here's the data from the processing we did in :doc:`brexit_analysis`:

* :download:`remain_leave.csv`.

Download the data file from the link above if you haven't already got it in
your Notebook directory.

.. nbplot::

    >>> import pandas as pd
    >>> remain_leave = pd.read_csv('remain_leave.csv')
    >>> remainers = remain_leave[remain_leave['brexit'] == 1]
    >>> brexiteers = remain_leave[remain_leave['brexit'] == 2]
    >>> # Confirm our proportions haven't changed
    >>> len(brexiteers) / (len(brexiteers) + len(remainers))
    0.4114068441064639

For convenience we get our ages scores into lists for each group:

.. nbplot::

    >>> # We make a list from the Pandas column with the "list" function
    >>> brexit_ages = list(brexiteers['age'])
    >>> remain_ages = list(remainers['age'])
    >>> # Check our proportions again
    >>> len(brexit_ages) / (len(brexit_ages) + len(remain_ages))
    0.4114068441064639

Let's put up the histograms of these two groups again:

.. mpl-interactive::

.. nbplot::

    >>> import matplotlib.pyplot as plt

.. nbplot::

    >>> plt.hist(brexit_ages)
    (...)

.. nbplot::

    >>> plt.hist(remain_ages)
    (...)

The remainers look as though they may be a bit younger on average.

Let's look at the mean age for the two groups.

The *mean* of the values in the list is defined as the sum divided by
the number of items.  Here is the mean age of the Brexit Leave voters:

.. nbplot::

    >>> sum(brexit_ages) / len(brexit_ages)
    51.715341959334566

We can define a function to calculate the mean:

.. nbplot::

    >>> def mean(some_list):
    ...     # Calculates the mean of the values in `some_list`
    ...     return sum(some_list) / len(some_list)

Now we can get the mean with:

.. nbplot::

    >>> mean(brexit_ages)
    51.715341959334566

The mean is lower for the Remain group:

.. nbplot::

    >>> mean(remain_ages)
    48.0155038759689...

The difference between the means is:

.. nbplot::

    >>> observed_difference = mean(brexit_ages) - mean(remain_ages)
    >>> observed_difference
    3.6998380833655773

We see that the means of the two groups are different - but can we be
confident that this difference did not come about by chance?

What do we mean by chance?  Now we have to define our null hypothesis.

We hypothesize that there is in fact no difference between the ages of the two
groups.  We propose that the difference between the means we see could
reasonably occur if we did the following procedure:

* sample 541 + 774 = 1315 people from this same underlying
  doesn't-matter-what-you-voted population;
* arbitrarily split this sample into a group of 541 and a group of 774, call
  these ``group_A`` and ``group_B``;
* calculate the mean age for ``group_A`` and subtract the mean age of
  ``group_B``;

Call this procedure - one trial.  To test our null hypothesis, we would like
to repeat this trial many times, recording the difference in means for each
trial. By doing this we could build up a distribution of the kind of
differences we would expect by chance - the sampling distribution.

But - we don't have any more data - so we don't have many new groups of 541 +
774 = 1315 to sample.  But it turns out, we can use the data we have to build
the distribution we need.

Let's start with the null hypothesis - that there is no difference in the ages
of the Leave and Remain groups. If that hypothesis is true, then all the
ages in ``brexit_ages`` and in ``remain_ages`` can be considered as
being samples from the same underlying group.

To express this, I'm going to pool all the weights into one big group, like
this:

.. nbplot::

    >>> # The + below appends the second list to the first
    >>> all_ages = brexit_ages + remain_ages
    >>> len(all_ages)
    1315

In this new pooled list, the first 541 ages are from the ``brexit_ages`` list,
and the rest are from the ``remain_ages`` list.

Now we have the new pooled list, we can do something similar to taking the new
``group_A`` and ``group_B`` groups we imagined above.  That is, we can
*shuffle* the combined group to a random order, and split this shuffled
combined group into a group of 541 and a group of 774.  We get the difference
in means of these two groups, and store it.  We keep shuffling, to create more
``group_A`` and ``group_B`` groups, and more differences in means. The
generated distribution of the mean differences is the distribution we expect
on the null hypothesis, that there is no real difference between the two
groups.  We can see where ``observed_difference`` lies in this generated
distribution, and this gives us a measure of whether the
``observed_difference`` is likely on the null hypothesis.

Python's ``random.shuffle`` function can do the shuffle for us (see:
:ref:`list-shuffle`):

.. nbplot::

    >>> import random

.. nbplot::
    :hide-from: all
    :show-to: doctest

    By setting the "seed" we make sure that the random permutations below are
    the same each time this code is run.  Comment this guy out to see what
    happens when you get a different set of random permutations below.

    >>> random.seed(7)

Before shuffling, the first 541 age values are all for Leave voters.  Here are
the first 10:

.. nbplot::

    >>> # The first 10 ages before shuffling (all brexit)
    >>> all_ages[:10]
    [71, 60, 74, 61, 47, 56, 76, 35, 44, 38]

Here's a random shuffle of the combined list of ages:

.. nbplot::

    >>> random.shuffle(all_ages)
    >>> # The first 10 ages of the shuffled list, mixed brexit and remain.
    >>> all_ages[:10]
    [24, 59, 55, 31, 89, 66, 64, 35, 28, 41]

Now for our permutation test.  We've assumed the null hypothesis.  We have
randomly shuffled the combined group.  We'll call the first 541 values
``group_A`` and the last 774 values ``group_B``.

After the shuffling, the ``group_A`` group is a random mix of the
``brexit_ages`` and ``remain_ages`` values, as is the ``group_B`` group.

Here is a function that takes the combined list and returns the difference in
means:

.. nbplot::

    >>> def difference_in_means(combined_list):
    ...     """ Split suffled combind group into two, return mean difference
    ...     """
    ...     group_A = combined_list[:541]
    ...     group_B = combined_list[541:]
    ...     return mean(group_B) - mean(group_A)

Let's get the difference in means for these new groups, generated by the
shuffle:

.. nbplot::

    >>> difference_in_means(all_ages)
    0.35758261808212666

That difference from the shuffled groups looks a lot less than the difference
we originally found:

.. nbplot::

    >>> observed_difference
    3.6998380833655773

One difference is not enough.  We need more shuffled mean differences to see
whether ``observed_difference`` is really unusual compared to the range of
permuted group differences.  Here we run the shuffle procedure 10000 times, to
get a large range of values:

.. nbplot::

    >>> n_repeats = 10000
    >>> shuffled_differences = []  # An empty list to store the differences
    >>> for i in range(n_repeats):
    ...     random.shuffle(all_ages)
    ...     new_difference = difference_in_means(all_ages)
    ...     # Collect the new mean by adding to the end of the list
    ...     shuffled_differences.append(new_difference)

This gives us 10000 differences from groups that are compatible with our null
hypothesis.   We can now ask whether ``observed_difference`` is unusually
large compared to the distribution of these 10000 differences.

.. mpl-interactive::

.. nbplot::

    >>> # The plotting package
    >>> import matplotlib.pyplot as plt

.. nbplot::

    >>> plt.hist(shuffled_differences)
    (...)

The maximumum of these differences is:

.. nbplot::

    >>> max(shuffled_differences)
    3.8717109191037764

Remember our ``observed_difference``?

.. nbplot::

    >>> observed_difference
    3.6998380833655773

So - how many of the ``shuffled_differences`` are greater than or equal to the
``observed_difference``?

.. nbplot::

    >>> n_greater_equal = 0
    >>> for i in range(n_repeats):
    ...     if shuffled_differences[i] >= observed_difference:
    ...         n_greater_equal = n_greater_equal + 1
    >>> n_greater_equal
    1

In 10000 samples, we only found one sample greater than or equal to the
observed difference.

So, our estimate is that there is a 1 in 10000 chance that the observed
difference could have come about by differences in random sampling.