Brexit analysis

Download notebook Interact

Brexit - the data analysis

We start, as usual, by importing all the libraries we need.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Fancy plots
plt.style.use('fivethirtyeight')

# Data frame library
import pandas as pd

All about the Brexiteers

Every year, the Hansard Society sponsors a survey on political engagement in the UK.

They put topical questions in each survey. For the 2016 / 7 survey, they asked about how people voted in the Brexit referendum.

Luckily, they make the data freely available online for us to analyze.

You can get the data for yourself from the UK Data Service: https://discover.ukdataservice.ac.uk/catalogue/?sn=8183. There are data files in various formats, including:

  • SPSS format (for the SPSS statistical package);
  • Stata format (for the Stata statistical package);
  • tab-delimited (a general data format, that can be used with Pandas, Excel, and other packages).

The data is in a standard form, with one row per respondent, and one column per question.

To save you a tiny bit of work, I have made an unchanged copy of the tab-delimited version of the data file for you to download directly. I have also made a copy of the document describing the questions they ask and the way that they have recorded the answers in the data file. This is often called the “data dictionary”. It was originally in Rich Text Format, but I have converted to PDF for convenience. It is otherwise identical to the file you will find at the UK Data Service.

You can download these copies from the following links:

If you are running this notebook on your laptop, download the tab-delimited data file to the same directory as the notebook.

In the moment, we are going to try and analyze these data. We will focus on two questions labeled cut15 and numage. cut15 is the question about Brexit. The data dictionary has the variable label “CUT15 - How did you vote on the question ‘Should the United Kingdom remain a member of the European Union or leave the European Union’?”. The recorded values run from 1 through 6 and have the following labels:

Value label information for cut15
Value = 1.0    Label = Remain a member of the European Union
Value = 2.0    Label = Leave the European Union
Value = 3.0    Label = Did not vote
Value = 4.0    Label = Too young
Value = 5.0    Label = Can't remember
Value = 6.0    Label = Refused

We also want the variable numage; this is the age of the respondent in years.

The data file that you just downloaded should be called audit_of_political_engagement_14_2017.tab. The cell below loads the data file into memory with Pandas:

# Load the data frame, and put it in the variable "audit_data"
audit_data = pd.read_table('audit_of_political_engagement_14_2017.tab')

As you know, we now have a data frame:

type(audit_data)
pandas.core.frame.DataFrame

The data frame has one row per person surveyed, and one column for each question in the survey. The columns have kind-of helpful names that you can read about in the data dictionary:

audit_data
cu041 cu042 cu043 cu044 cu045 cu046 cu047 cu048 cu049 cu0410 ... intten cx_971_980 serial week wts numage weight0 sgrade_grp age_grp region2
0 0 0 0 0 1 1 0 0 0 0 ... -1 3.41659 1399 648 3.41659 37 3.41659 1 4 3
1 0 0 0 0 0 0 0 0 0 1 ... -1 2.68198 1733 648 2.68198 55 2.68198 2 6 3
2 0 0 0 0 1 0 0 0 0 0 ... -1 0.79379 1736 648 0.79379 71 0.79379 2 7 4
3 0 0 0 0 1 0 1 0 0 0 ... -1 1.40580 1737 648 1.40580 37 1.40580 1 4 4
4 0 0 0 1 1 0 1 0 0 0 ... -1 0.89475 1738 648 0.89475 42 0.89475 2 4 4
5 1 1 0 0 0 0 0 0 0 0 ... -1 3.22535 1801 648 3.22535 0 3.22535 1 1 3
6 0 0 0 0 1 0 0 0 0 0 ... -1 1.52922 1802 648 1.52922 69 1.52922 1 7 3
7 1 0 0 0 1 0 0 0 0 0 ... -1 2.89655 1803 648 2.89655 20 2.89655 2 2 3
8 0 0 0 0 1 0 0 0 0 0 ... -1 4.66393 1804 648 4.66393 38 4.66393 1 4 3
9 1 0 0 1 1 0 1 0 0 0 ... -1 1.43732 1806 648 1.43732 60 1.43732 4 6 3
10 0 0 0 1 0 0 0 0 0 0 ... -1 1.81109 1807 648 1.81109 0 1.81109 3 1 3
11 0 0 0 0 0 0 0 0 0 0 ... -1 4.18632 1808 648 4.18632 32 4.18632 1 3 3
12 0 0 0 0 0 0 0 0 0 0 ... -1 1.81382 1809 648 1.81382 79 1.81382 2 8 3
13 1 0 0 0 0 1 0 0 0 0 ... -1 1.73488 1811 648 1.73488 0 1.73488 4 1 4
14 1 0 0 0 1 1 1 0 0 0 ... -1 1.22631 1812 648 1.22631 58 1.22631 1 6 4
15 0 0 0 1 0 1 0 0 0 0 ... -1 1.73024 1813 648 1.73024 46 1.73024 1 5 4
16 0 0 0 0 0 1 0 0 0 0 ... -1 1.83130 1814 648 1.83130 51 1.83130 1 5 4
17 1 0 0 0 0 0 0 0 0 0 ... -1 0.50172 1815 648 0.50172 74 0.50172 3 7 4
18 0 0 0 0 0 1 0 0 0 0 ... -1 3.33900 1816 648 3.33900 57 3.33900 1 6 3
19 0 0 0 0 0 0 0 0 0 0 ... -1 1.89094 1817 648 1.89094 61 1.89094 3 6 3
20 0 0 0 0 0 0 0 0 0 0 ... -1 3.80239 1818 648 3.80239 47 3.80239 2 5 3
21 0 0 0 0 0 1 0 0 0 0 ... -1 1.55589 1819 648 1.55589 56 1.55589 4 6 3
22 0 0 0 0 0 0 0 0 0 0 ... -1 2.73494 1820 648 2.73494 87 2.73494 2 8 3
23 0 0 0 0 0 0 0 0 0 0 ... -1 3.18552 1821 648 3.18552 76 3.18552 1 8 3
24 0 0 0 0 1 0 0 0 0 0 ... -1 4.38354 1822 648 4.38354 20 4.38354 1 2 3
25 0 0 0 0 0 0 0 0 0 0 ... -1 0.63495 1823 648 0.63495 35 0.63495 3 4 4
26 1 0 0 0 1 1 0 0 0 0 ... -1 0.48390 1824 648 0.48390 28 0.48390 2 3 1
27 0 0 0 0 0 0 0 0 0 0 ... -1 0.85149 1825 648 0.85149 38 0.85149 2 4 1
28 0 0 0 0 0 0 0 0 0 0 ... -1 1.12183 1828 648 1.12183 44 1.12183 3 4 3
29 0 0 0 1 0 0 0 0 0 0 ... -1 2.38711 1829 648 2.38711 38 2.38711 2 4 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1741 0 0 0 0 0 0 0 0 0 0 ... 9 0.45925 3262 649 0.45925 35 0.45925 3 4 4
1742 0 0 0 0 0 0 0 0 0 0 ... 9 0.97481 3284 649 0.97481 39 0.97481 3 4 6
1743 0 0 0 1 1 0 0 0 0 0 ... 9 0.97507 3291 649 0.97507 39 0.97507 1 4 5
1744 0 0 0 0 0 0 0 0 0 0 ... 9 1.81637 3295 649 1.81637 44 1.81637 2 4 5
1745 0 0 0 1 0 1 0 0 0 0 ... 9 0.13786 3329 649 0.13786 18 0.13786 2 2 6
1746 0 0 0 0 0 0 0 0 0 0 ... 9 0.34092 3344 649 0.34092 40 0.34092 1 4 5
1747 0 0 0 0 0 0 0 0 0 0 ... 9 1.09296 3364 649 1.09296 60 1.09296 3 6 4
1748 0 0 0 0 0 0 0 0 0 0 ... 4 1.16371 3367 649 1.16371 36 1.16371 2 4 4
1749 0 0 0 0 0 1 0 0 0 0 ... 9 0.98811 3368 649 0.98811 72 0.98811 4 7 4
1750 0 0 0 0 0 0 0 0 0 0 ... 9 0.36535 3370 649 0.36535 70 0.36535 4 7 6
1751 0 0 0 0 0 0 0 0 0 0 ... 9 0.76167 3372 649 0.76167 31 0.76167 1 3 6
1752 0 0 0 0 0 0 0 0 0 0 ... 4 0.24729 3377 649 0.24729 20 0.24729 4 2 6
1753 0 0 0 0 0 0 0 0 0 0 ... 6 0.17248 3378 649 0.17248 67 0.17248 4 7 6
1754 0 0 0 0 0 0 0 0 0 0 ... 5 0.22854 3380 649 0.22854 54 0.22854 4 5 6
1755 0 0 0 0 0 0 0 0 0 0 ... 9 0.39888 3382 649 0.39888 18 0.39888 4 2 6
1756 0 0 0 0 0 0 0 0 0 0 ... 9 0.17631 3384 649 0.17631 18 0.17631 2 2 6
1757 0 0 0 0 0 0 0 0 0 0 ... 9 0.32814 3386 649 0.32814 24 0.32814 4 2 6
1758 0 0 0 0 0 0 0 0 0 0 ... 9 0.21588 3388 649 0.21588 20 0.21588 2 2 6
1759 1 0 0 0 0 0 0 0 0 0 ... 9 0.61616 3389 649 0.61616 36 0.61616 3 4 6
1760 0 0 0 0 0 0 0 0 0 0 ... 9 0.21250 3390 649 0.21250 42 0.21250 4 4 6
1761 0 0 0 0 1 0 0 0 0 0 ... 9 0.32776 3392 649 0.32776 37 0.32776 1 4 6
1762 0 0 0 0 0 0 0 0 0 0 ... 7 0.27406 3394 649 0.27406 19 0.27406 4 2 6
1763 0 0 0 0 0 0 0 0 0 0 ... 9 0.63120 3399 649 0.63120 36 0.63120 1 4 6
1764 0 0 1 1 0 1 0 0 0 0 ... 9 0.65792 3407 649 0.65792 67 0.65792 4 7 4
1765 0 0 0 0 0 0 0 0 0 0 ... 9 0.54415 3422 649 0.54415 40 0.54415 2 4 4
1766 0 0 0 0 0 0 0 0 0 0 ... 9 0.44339 3423 649 0.44339 39 0.44339 1 4 4
1767 0 0 0 0 0 0 0 0 0 0 ... 9 0.44086 3425 649 0.44086 20 0.44086 3 2 4
1768 0 0 0 0 0 0 0 0 0 0 ... 9 0.32590 3426 649 0.32590 31 0.32590 3 3 4
1769 0 0 0 0 0 0 0 0 0 0 ... 9 0.66970 3427 649 0.66970 47 0.66970 4 5 4
1770 0 0 0 0 0 0 0 0 0 0 ... 9 0.39478 3434 649 0.39478 25 0.39478 4 3 4

1771 rows × 370 columns

The data frame has columns for all the questions listed in the data dictionary:

audit_data.columns
Index(['cu041', 'cu042', 'cu043', 'cu044', 'cu045', 'cu046', 'cu047', 'cu048',
       'cu049', 'cu0410',
       ...
       'intten', 'cx_971_980', 'serial', 'week', 'wts', 'numage', 'weight0',
       'sgrade_grp', 'age_grp', 'region2'],
      dtype='object', length=370)

To reduce clutter, we first make a new data frame that just has the two questions we are interested in. To do this we first make a list with the names of the columns we want:

desired_columns = ["numage", "cut15"]

Then we use this list, to make a new data frame, that only has the named columns, like this:

# Select the age and Brexit vote questions only
mini_brexit = audit_data[['numage', 'cut15']]
mini_brexit
numage cut15
0 37 1
1 55 1
2 71 2
3 37 1
4 42 1
5 0 1
6 69 1
7 20 1
8 38 1
9 60 2
10 0 1
11 32 1
12 79 3
13 0 2
14 58 1
15 46 1
16 51 1
17 74 2
18 57 1
19 61 2
20 47 2
21 56 2
22 87 1
23 76 2
24 20 3
25 35 2
26 28 1
27 38 1
28 44 2
29 38 2
... ... ...
1741 35 1
1742 39 3
1743 39 1
1744 44 1
1745 18 4
1746 40 1
1747 60 3
1748 36 3
1749 72 1
1750 70 1
1751 31 3
1752 20 3
1753 67 1
1754 54 2
1755 18 2
1756 18 4
1757 24 3
1758 20 2
1759 36 3
1760 42 3
1761 37 1
1762 19 5
1763 36 6
1764 67 1
1765 40 2
1766 39 1
1767 20 3
1768 31 2
1769 47 3
1770 25 3

1771 rows × 2 columns

To get started on exploring, we make a new variable ages that refers to the numage column in the mini_brexit data frame.

# Make a new variable "ages" that refers to the "numage" column in "mini_brexit"
ages = mini_brexit["numage"]

Confirm that ages has a value of type Series, the Pandas type for a column of a data frame:

type(ages)
pandas.core.series.Series

Here are the numbers of rows, columns in the original data frame:

audit_data.shape
(1771, 370)

Run the cell below to confirm that ages has the same number of values, as audit_data has rows. To do this, we can use the len function, as applied to the ages Series. It returns the number of values.

len(ages)
1771

In fact, len, as applied to the data frame, returns the number of rows:

len(audit_data)
1771

Start by doing a histogram of the values in ages (which are also the values in the numage column of mini_brexit). If you can’t remember how to do histograms, have a look at the introduction to data frames notebook. Hint: consider using the hist method of the ages variable.

# Do a histogram of the values in the "numage" column.
# Your code here.

You will see that a few subjects have an age of 0.

It looks as if the survey coders are using the value 0 to mean that the person did not state their age. We will have to clean that up. We do that by selecting the cases that have ages not equal to 0.

Hint: You have seen the operator to say whether two values are equal or no:

1 == 2
False
2 == 2
True

The operator for not equal is !=, as in:

1 != 2
True

Prepare for brain-bending double negative…

2 != 2
False

To identify the values in ages that are not equal to 0, use the comparison I’ve hinted at above, to make a new variable, age_not_0, that has the same number of values as ages, and has True at positions where ages is not equal to 0, and False otherwise. We will refer to these sequences of True and False values, as Boolean vectors.

Check back to the introduction to data frames notebook for a reminder of making and using Boolean vectors to select rows from data frames.

# Create new variable "age_not_0", with True at positions where "ages" is not
# equal to 0, and False otherwise.
# age_not_0 = ?

Use age_not_0 to select rows in the mini_brexit data frame where the value is True, and throw away the rows where the value is False. To do this, use the loc function attached to the data frame. It locates values:

# Select rows in the data frame where the age is not equal to 0 Make a new data
# frame called "good_brexit" that only contains these rows.  Your code will start
# good_brexit = mini_brexit.loc?

Now we want to ask what proportion of the respondents said that they voted Remain or Leave.

First we make a new data frame that contains only the rows for people who said they voted No in the referendum (remain). Remember, from the data dictionary, that 1 is the code for a No vote.

First, make a new variable votes that has the values of cut15 column of the good_brexit data frame.

# Make a new variable "votes" that refers to the "cut15" column in "good_brexit".
# Your code will start with
# votes = ?

Now make a new Boolean vector, that has True at the positions where votes is equal to 1, and False otherwise. Call this variable is_remain.

# Make a Boolean vector, called "is_remain", that True for Remain row, False
# otherwise.
# is_remain = ?

Next, use is_remain to select the rows in good_brexit that correspond to confessed “Remain” voters. Call the new data frame remainers:

# Select the rows from "good_brexit" that correspond to Remain voters
# remainers = good_brexit?

Do a histogram of the values in the numage column of remainers:

# Show a histogram of the `numage` column from `remainers`

Next, go through the same steps, to make a new data frame for those who claimed to vote Yes (leave) (code 2):

# Make a Boolean vector, called "is_leave", that True for Leave row, False
# otherwise.
# is_leave = ?

Next, use is_leave to select the rows in good_brexit that correspond to confessed “Leave” voters. Call the new data frame leavers:

# Select the rows from "good_brexit" that correspond to Leave voters
# leavers = good_brexit?

Do a histogram of the values in the numage column of leavers:

# Show a histgram of the `numage` column from `remainers`

Uncomment the lines in the cell below to get the total number of Remain voters:

# n_remain = len(remainers)
# n_remain

Here is the total number of Leave voters:

# n_leave = len(leavers)
# n_leave

Here is the total number of voters who confessed to a specific Leave or Remain vote:

# n_total = n_leave + n_remain
# n_total

Here is the proportion of Leave voters:

# leave_proportion = n_leave / n_total
# leave_proportion

As you remember, the proportion of Leave voters in the referendum was 51.9%. leave_proportion from the survey seems a way off. Is it too far off?

You go back to the survey company and tell them that the proportion of Leave voters seems too low.

They say the following:

We took a random sample of the population. You are a data scientist, you know well that the proportion from this random sample is very unlikely to be exactly the same as the proportion in the whole population. The proportion we get is compatible with the variation we expect from taking a random sample.

In other words - the difference in the proportions, between the referendum and the survey, is due to sampling error.

Time for a simulation.

The null hypothesis offered by the survey company is that the proportion we saw above is a plausible value if we took a random sample of n_total voters.

We can simulate a new survey, with n_total voters, by taking n_total random numbers between 0 and 1. We consider the values less than 0.52 as corresponding to a Leave vote, and the rest are Remain votes. We then calculate the proportion of Leave votes (proportion of values where value < 0.519 == True).

We do this 10000 times, to get 10000 simulated surveys. We calculate the proportions for each simulated survey, and do a histogram of the proportions. Is leave_proportion a plausible value on this histogram?

See:

to remind yourself about simulations.

# Your simulation here