Brexit analysis

Download notebook Interact

Brexit - the data analysis

We start, as usual, by importing all the libraries we need.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Fancy plots
plt.style.use('fivethirtyeight')

# Data frame library
import pandas as pd

All about the Brexiteers

Every year, the Hansard Society sponsors a survey on political engagement in the UK.

They put topical questions in each survey. For the 2016 / 7 survey, they asked about how people voted in the Brexit referendum.

Luckily, they make the data freely available online for us to analyze.

You can get the data for yourself from the UK Data Service: https://discover.ukdataservice.ac.uk/catalogue/?sn=8183. There are data files in various formats, including:

SPSS format (for the SPSS statistical package);
Stata format (for the Stata statistical package);
tab-delimited (a general data format, that can be used with Pandas, Excel, and other packages).

The data is in a standard form, with one row per respondent, and one column per question.

To save you a tiny bit of work, I have made an unchanged copy of the tab-delimited version of the data file for you to download directly. I have also made a copy of the document describing the questions they ask and the way that they have recorded the answers in the data file. This is often called the “data dictionary”. It was originally in Rich Text Format, but I have converted to PDF for convenience. It is otherwise identical to the file you will find at the UK Data Service.

You can download these copies from the following links:

If you are running this notebook on your laptop, download the tab-delimited data file to the same directory as the notebook.

In the moment, we are going to try and analyze these data. We will focus on two questions labeled cut15 and numage. cut15 is the question about Brexit. The data dictionary has the variable label “CUT15 - How did you vote on the question ‘Should the United Kingdom remain a member of the European Union or leave the European Union’?”. The recorded values run from 1 through 6 and have the following labels:

Value label information for cut15
Value = 1.0    Label = Remain a member of the European Union
Value = 2.0    Label = Leave the European Union
Value = 3.0    Label = Did not vote
Value = 4.0    Label = Too young
Value = 5.0    Label = Can't remember
Value = 6.0    Label = Refused

We also want the variable numage; this is the age of the respondent in years.

The data file that you just downloaded should be called audit_of_political_engagement_14_2017.tab. The cell below loads the data file into memory with Pandas:

# Load the data frame, and put it in the variable "audit_data"
audit_data = pd.read_table('audit_of_political_engagement_14_2017.tab')

As you know, we now have a data frame:

type(audit_data)

pandas.core.frame.DataFrame

The data frame has one row per person surveyed, and one column for each question in the survey. The columns have kind-of helpful names that you can read about in the data dictionary:

audit_data

	cu041	cu042	cu043	cu044	cu045	cu046	cu047	cu048	cu049	cu0410	...	intten	cx_971_980	serial	week	wts	numage	weight0	sgrade_grp	age_grp	region2
0	0	0	0	0	1	1	0	0	0	0	...	-1	3.41659	1399	648	3.41659	37	3.41659	1	4	3
1	0	0	0	0	0	0	0	0	0	1	...	-1	2.68198	1733	648	2.68198	55	2.68198	2	6	3
2	0	0	0	0	1	0	0	0	0	0	...	-1	0.79379	1736	648	0.79379	71	0.79379	2	7	4
3	0	0	0	0	1	0	1	0	0	0	...	-1	1.40580	1737	648	1.40580	37	1.40580	1	4	4
4	0	0	0	1	1	0	1	0	0	0	...	-1	0.89475	1738	648	0.89475	42	0.89475	2	4	4
5	1	1	0	0	0	0	0	0	0	0	...	-1	3.22535	1801	648	3.22535	0	3.22535	1	1	3
6	0	0	0	0	1	0	0	0	0	0	...	-1	1.52922	1802	648	1.52922	69	1.52922	1	7	3
7	1	0	0	0	1	0	0	0	0	0	...	-1	2.89655	1803	648	2.89655	20	2.89655	2	2	3
8	0	0	0	0	1	0	0	0	0	0	...	-1	4.66393	1804	648	4.66393	38	4.66393	1	4	3
9	1	0	0	1	1	0	1	0	0	0	...	-1	1.43732	1806	648	1.43732	60	1.43732	4	6	3
10	0	0	0	1	0	0	0	0	0	0	...	-1	1.81109	1807	648	1.81109	0	1.81109	3	1	3
11	0	0	0	0	0	0	0	0	0	0	...	-1	4.18632	1808	648	4.18632	32	4.18632	1	3	3
12	0	0	0	0	0	0	0	0	0	0	...	-1	1.81382	1809	648	1.81382	79	1.81382	2	8	3
13	1	0	0	0	0	1	0	0	0	0	...	-1	1.73488	1811	648	1.73488	0	1.73488	4	1	4
14	1	0	0	0	1	1	1	0	0	0	...	-1	1.22631	1812	648	1.22631	58	1.22631	1	6	4
15	0	0	0	1	0	1	0	0	0	0	...	-1	1.73024	1813	648	1.73024	46	1.73024	1	5	4
16	0	0	0	0	0	1	0	0	0	0	...	-1	1.83130	1814	648	1.83130	51	1.83130	1	5	4
17	1	0	0	0	0	0	0	0	0	0	...	-1	0.50172	1815	648	0.50172	74	0.50172	3	7	4
18	0	0	0	0	0	1	0	0	0	0	...	-1	3.33900	1816	648	3.33900	57	3.33900	1	6	3
19	0	0	0	0	0	0	0	0	0	0	...	-1	1.89094	1817	648	1.89094	61	1.89094	3	6	3
20	0	0	0	0	0	0	0	0	0	0	...	-1	3.80239	1818	648	3.80239	47	3.80239	2	5	3
21	0	0	0	0	0	1	0	0	0	0	...	-1	1.55589	1819	648	1.55589	56	1.55589	4	6	3
22	0	0	0	0	0	0	0	0	0	0	...	-1	2.73494	1820	648	2.73494	87	2.73494	2	8	3
23	0	0	0	0	0	0	0	0	0	0	...	-1	3.18552	1821	648	3.18552	76	3.18552	1	8	3
24	0	0	0	0	1	0	0	0	0	0	...	-1	4.38354	1822	648	4.38354	20	4.38354	1	2	3
25	0	0	0	0	0	0	0	0	0	0	...	-1	0.63495	1823	648	0.63495	35	0.63495	3	4	4
26	1	0	0	0	1	1	0	0	0	0	...	-1	0.48390	1824	648	0.48390	28	0.48390	2	3	1
27	0	0	0	0	0	0	0	0	0	0	...	-1	0.85149	1825	648	0.85149	38	0.85149	2	4	1
28	0	0	0	0	0	0	0	0	0	0	...	-1	1.12183	1828	648	1.12183	44	1.12183	3	4	3
29	0	0	0	1	0	0	0	0	0	0	...	-1	2.38711	1829	648	2.38711	38	2.38711	2	4	3
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1741	0	0	0	0	0	0	0	0	0	0	...	9	0.45925	3262	649	0.45925	35	0.45925	3	4	4
1742	0	0	0	0	0	0	0	0	0	0	...	9	0.97481	3284	649	0.97481	39	0.97481	3	4	6
1743	0	0	0	1	1	0	0	0	0	0	...	9	0.97507	3291	649	0.97507	39	0.97507	1	4	5
1744	0	0	0	0	0	0	0	0	0	0	...	9	1.81637	3295	649	1.81637	44	1.81637	2	4	5
1745	0	0	0	1	0	1	0	0	0	0	...	9	0.13786	3329	649	0.13786	18	0.13786	2	2	6
1746	0	0	0	0	0	0	0	0	0	0	...	9	0.34092	3344	649	0.34092	40	0.34092	1	4	5
1747	0	0	0	0	0	0	0	0	0	0	...	9	1.09296	3364	649	1.09296	60	1.09296	3	6	4
1748	0	0	0	0	0	0	0	0	0	0	...	4	1.16371	3367	649	1.16371	36	1.16371	2	4	4
1749	0	0	0	0	0	1	0	0	0	0	...	9	0.98811	3368	649	0.98811	72	0.98811	4	7	4
1750	0	0	0	0	0	0	0	0	0	0	...	9	0.36535	3370	649	0.36535	70	0.36535	4	7	6
1751	0	0	0	0	0	0	0	0	0	0	...	9	0.76167	3372	649	0.76167	31	0.76167	1	3	6
1752	0	0	0	0	0	0	0	0	0	0	...	4	0.24729	3377	649	0.24729	20	0.24729	4	2	6
1753	0	0	0	0	0	0	0	0	0	0	...	6	0.17248	3378	649	0.17248	67	0.17248	4	7	6
1754	0	0	0	0	0	0	0	0	0	0	...	5	0.22854	3380	649	0.22854	54	0.22854	4	5	6
1755	0	0	0	0	0	0	0	0	0	0	...	9	0.39888	3382	649	0.39888	18	0.39888	4	2	6
1756	0	0	0	0	0	0	0	0	0	0	...	9	0.17631	3384	649	0.17631	18	0.17631	2	2	6
1757	0	0	0	0	0	0	0	0	0	0	...	9	0.32814	3386	649	0.32814	24	0.32814	4	2	6
1758	0	0	0	0	0	0	0	0	0	0	...	9	0.21588	3388	649	0.21588	20	0.21588	2	2	6
1759	1	0	0	0	0	0	0	0	0	0	...	9	0.61616	3389	649	0.61616	36	0.61616	3	4	6
1760	0	0	0	0	0	0	0	0	0	0	...	9	0.21250	3390	649	0.21250	42	0.21250	4	4	6
1761	0	0	0	0	1	0	0	0	0	0	...	9	0.32776	3392	649	0.32776	37	0.32776	1	4	6
1762	0	0	0	0	0	0	0	0	0	0	...	7	0.27406	3394	649	0.27406	19	0.27406	4	2	6
1763	0	0	0	0	0	0	0	0	0	0	...	9	0.63120	3399	649	0.63120	36	0.63120	1	4	6
1764	0	0	1	1	0	1	0	0	0	0	...	9	0.65792	3407	649	0.65792	67	0.65792	4	7	4
1765	0	0	0	0	0	0	0	0	0	0	...	9	0.54415	3422	649	0.54415	40	0.54415	2	4	4
1766	0	0	0	0	0	0	0	0	0	0	...	9	0.44339	3423	649	0.44339	39	0.44339	1	4	4
1767	0	0	0	0	0	0	0	0	0	0	...	9	0.44086	3425	649	0.44086	20	0.44086	3	2	4
1768	0	0	0	0	0	0	0	0	0	0	...	9	0.32590	3426	649	0.32590	31	0.32590	3	3	4
1769	0	0	0	0	0	0	0	0	0	0	...	9	0.66970	3427	649	0.66970	47	0.66970	4	5	4
1770	0	0	0	0	0	0	0	0	0	0	...	9	0.39478	3434	649	0.39478	25	0.39478	4	3	4

1771 rows × 370 columns

The data frame has columns for all the questions listed in the data dictionary:

audit_data.columns

Index(['cu041', 'cu042', 'cu043', 'cu044', 'cu045', 'cu046', 'cu047', 'cu048',
       'cu049', 'cu0410',
       ...
       'intten', 'cx_971_980', 'serial', 'week', 'wts', 'numage', 'weight0',
       'sgrade_grp', 'age_grp', 'region2'],
      dtype='object', length=370)

To reduce clutter, we first make a new data frame that just has the two questions we are interested in. To do this we first make a list with the names of the columns we want:

desired_columns = ["numage", "cut15"]

Then we use this list, to make a new data frame, that only has the named columns, like this:

# Select the age and Brexit vote questions only
mini_brexit = audit_data[['numage', 'cut15']]
mini_brexit

	numage	cut15
0	37	1
1	55	1
2	71	2
3	37	1
4	42	1
5	0	1
6	69	1
7	20	1
8	38	1
9	60	2
10	0	1
11	32	1
12	79	3
13	0	2
14	58	1
15	46	1
16	51	1
17	74	2
18	57	1
19	61	2
20	47	2
21	56	2
22	87	1
23	76	2
24	20	3
25	35	2
26	28	1
27	38	1
28	44	2
29	38	2
...	...	...
1741	35	1
1742	39	3
1743	39	1
1744	44	1
1745	18	4
1746	40	1
1747	60	3
1748	36	3
1749	72	1
1750	70	1
1751	31	3
1752	20	3
1753	67	1
1754	54	2
1755	18	2
1756	18	4
1757	24	3
1758	20	2
1759	36	3
1760	42	3
1761	37	1
1762	19	5
1763	36	6
1764	67	1
1765	40	2
1766	39	1
1767	20	3
1768	31	2
1769	47	3
1770	25	3

1771 rows × 2 columns

To get started on exploring, we make a new variable ages that refers to the numage column in the mini_brexit data frame.

# Make a new variable "ages" that refers to the "numage" column in "mini_brexit"
ages = mini_brexit["numage"]

Confirm that ages has a value of type Series, the Pandas type for a column of a data frame:

type(ages)

pandas.core.series.Series

Here are the numbers of rows, columns in the original data frame:

audit_data.shape

(1771, 370)

Run the cell below to confirm that ages has the same number of values, as audit_data has rows. To do this, we can use the len function, as applied to the ages Series. It returns the number of values.

len(ages)

In fact, len, as applied to the data frame, returns the number of rows:

len(audit_data)

Start by doing a histogram of the values in ages (which are also the values in the numage column of mini_brexit). If you can’t remember how to do histograms, have a look at the introduction to data frames notebook. Hint: consider using the hist method of the ages variable.

# Do a histogram of the values in the "numage" column.
# Your code here.

You will see that a few subjects have an age of 0.

It looks as if the survey coders are using the value 0 to mean that the person did not state their age. We will have to clean that up. We do that by selecting the cases that have ages not equal to 0.

Hint: You have seen the operator to say whether two values are equal or no:

1 == 2

False

2 == 2

True

The operator for not equal is !=, as in:

1 != 2

True

Prepare for brain-bending double negative…

2 != 2

False

To identify the values in ages that are not equal to 0, use the comparison I’ve hinted at above, to make a new variable, age_not_0, that has the same number of values as ages, and has True at positions where ages is not equal to 0, and False otherwise. We will refer to these sequences of True and False values, as Boolean vectors.

Check back to the introduction to data frames notebook for a reminder of making and using Boolean vectors to select rows from data frames.

# Create new variable "age_not_0", with True at positions where "ages" is not
# equal to 0, and False otherwise.
# age_not_0 = ?

Use age_not_0 to select rows in the mini_brexit data frame where the value is True, and throw away the rows where the value is False. To do this, use the loc function attached to the data frame. It locates values:

# Select rows in the data frame where the age is not equal to 0 Make a new data
# frame called "good_brexit" that only contains these rows.  Your code will start
# good_brexit = mini_brexit.loc?

Now we want to ask what proportion of the respondents said that they voted Remain or Leave.

First we make a new data frame that contains only the rows for people who said they voted No in the referendum (remain). Remember, from the data dictionary, that 1 is the code for a No vote.

First, make a new variable votes that has the values of cut15 column of the good_brexit data frame.

# Make a new variable "votes" that refers to the "cut15" column in "good_brexit".
# Your code will start with
# votes = ?

Now make a new Boolean vector, that has True at the positions where votes is equal to 1, and False otherwise. Call this variable is_remain.

# Make a Boolean vector, called "is_remain", that True for Remain row, False
# otherwise.
# is_remain = ?

Next, use is_remain to select the rows in good_brexit that correspond to confessed “Remain” voters. Call the new data frame remainers:

# Select the rows from "good_brexit" that correspond to Remain voters
# remainers = good_brexit?

Do a histogram of the values in the numage column of remainers:

# Show a histogram of the `numage` column from `remainers`

Next, go through the same steps, to make a new data frame for those who claimed to vote Yes (leave) (code 2):

# Make a Boolean vector, called "is_leave", that True for Leave row, False
# otherwise.
# is_leave = ?

Next, use is_leave to select the rows in good_brexit that correspond to confessed “Leave” voters. Call the new data frame leavers:

# Select the rows from "good_brexit" that correspond to Leave voters
# leavers = good_brexit?

Do a histogram of the values in the numage column of leavers:

# Show a histgram of the `numage` column from `remainers`

Uncomment the lines in the cell below to get the total number of Remain voters:

# n_remain = len(remainers)
# n_remain

Here is the total number of Leave voters:

# n_leave = len(leavers)
# n_leave

Here is the total number of voters who confessed to a specific Leave or Remain vote:

# n_total = n_leave + n_remain
# n_total

Here is the proportion of Leave voters:

# leave_proportion = n_leave / n_total
# leave_proportion

As you remember, the proportion of Leave voters in the referendum was 51.9%. leave_proportion from the survey seems a way off. Is it too far off?

You go back to the survey company and tell them that the proportion of Leave voters seems too low.

They say the following:

We took a random sample of the population. You are a data scientist, you know well that the proportion from this random sample is very unlikely to be exactly the same as the proportion in the whole population. The proportion we get is compatible with the variation we expect from taking a random sample.

In other words - the difference in the proportions, between the referendum and the survey, is due to sampling error.

Time for a simulation.

The null hypothesis offered by the survey company is that the proportion we saw above is a plausible value if we took a random sample of n_total voters.

We can simulate a new survey, with n_total voters, by taking n_total random numbers between 0 and 1. We consider the values less than 0.52 as corresponding to a Leave vote, and the rest are Remain votes. We then calculate the proportion of Leave votes (proportion of values where value < 0.519 == True).

We do this 10000 times, to get 10000 simulated surveys. We calculate the proportions for each simulated survey, and do a histogram of the proportions. Is leave_proportion a plausible value on this histogram?

See:

to remind yourself about simulations.

# Your simulation here