5.1 Population and permutation
A problem of populations
As in the Brexit analysis exercise, we analyze the Brexit survey.
As you will see in the link above, the data are from a survey of the UK
population. Each row in the survey corresponds to one person answering. One
of the questions, named cut15
is how the person voted in the Brexit
referendum. Another, numage
is the age of the person in years.
# Array library.
import numpy as np
# Data frame library.
import pandas as pd
# Plotting
import matplotlib.pyplot as plt
%matplotlib inline
# Fancy plots
plt.style.use('fivethirtyeight')
If you are running on your laptop, first download the data file to the same directory as this notebook.
We load the data:
# Load the data frame, and put it in the variable "audit_data".
# The values are separated by tab characters, written as "\t" in Python
# strings.
audit_data = pd.read_csv('audit_of_political_engagement_14_2017.tab', sep='\t')
Now get the ages for the Leavers and the Remainers.
A small number of ages are recorded as 0, meaning we do not have the correct age for that person / row. First we drop rows with ages recorded as 0, then select the remaining rows corresponding to people who voted to remain (cut15
value of 1) and leave (cut15
value of 2):
# Drop rows where age is 0
good_data = audit_data[audit_data['numage'] != 0]
# Get data frames for leavers and remainers
remain_ages = good_data[good_data['cut15'] == 1]['numage']
leave_ages = good_data[good_data['cut15'] == 2]['numage']
Show the age distributions for the two groups:
remain_ages.hist()
len(remain_ages)
774
leave_ages.hist()
len(leave_ages)
541
These certainly look like different distributions.
We might summarize the difference, by looking at the difference in means:
leave_mean = np.mean(leave_ages)
leave_mean
51.715341959334566
remain_mean = np.mean(remain_ages)
remain_mean
48.01550387596899
difference = leave_mean - remain_mean
difference
3.6998380833655773
The distributions do look different.
They have a mean difference of nearly 4 years.
Could this be due to sampling error?
If we took two random samples of 774 and 541 voters, from the same population, we would expect to see some difference, just by chance.
By chance means, because random samples vary.
What is the population, in this case?
It is not exactly the whole UK population, because the survey only sampled people who were eligible to vote.
It might not even be the whole UK population, who are eligible to vote. Perhaps the survey company got a not-representative range of ages, for some reason. We are not interested in that question, only the question of whether the Leave and Remain voters could come from the same population, where the population is, people selected by the survey company.
How do we find this population, to do our simulation?
Population by permutation
Here comes a nice trick. We can use the data that we already have, to simulate the effect of drawing lots of random samples, from the underlying population.
Let us assume that the Leave voters and the Remain voters are in fact samples from the same underlying population.
If that is the case, we can throw the Leave and Remain voters into one big pool of 774 + 541 == 1315 voters.
Then we can take split this new mixed sample into two groups, at random, one with 774 voters, and the other with 541. The new groups have a random mix of the original Leave and Remain voters. Then we calculate the difference in means between these two new, fake groups.
pooled = np.append(remain_ages, leave_ages)
pooled
array([37, 55, 37, ..., 20, 40, 31])
len(pooled)
1315
We mix the two samples together, using np.random.shuffle
. It works like
this:
pets = np.array(['cat', 'dog', 'rabbit'])
pets
array(['cat', 'dog', 'rabbit'], dtype='<U6')
np.random.shuffle(pets)
pets
array(['cat', 'rabbit', 'dog'], dtype='<U6')
np.random.shuffle(pets)
pets
array(['cat', 'rabbit', 'dog'], dtype='<U6')
Now to mix up ages of the Leavers and Remainers:
np.random.shuffle(pooled)
pooled
array([54, 50, 19, ..., 52, 80, 92])
We split the newly mixed group into 774 simulated Remain voters and 541 simulated Leave voters, where each group is a random mix of the original Leave and Remain ages.
# The first 774 values
fake_remainers = pooled[:774]
# The rest
fake_leavers = pooled[774:]
len(fake_leavers)
541
Now we can calculate the mean difference. This is our first simulation:
fake_difference = np.mean(fake_leavers) - np.mean(fake_remainers)
fake_difference
0.11033973835417754
That looks a lot smaller than the difference we saw. We want to keep doing this, to collect more simulations. We need to mix up the ages again, to give us new random samples of fake Remainers and fake Leavers.
np.random.shuffle(pooled)
fake_difference_2 = np.mean(pooled[:774]) - np.mean(pooled[774:])
fake_difference_2
-1.5329493186605347
We want to keep doing this - and that calls for a for
loop.