Brexit analysis
Brexit - the data analysis
We start, as usual, by importing all the libraries we need.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Fancy plots
plt.style.use('fivethirtyeight')
# Data frame library
import pandas as pd
See the Brexit survey for details on the survey data.
If you are running on your laptop, first download the data file to the same directory as this notebook.
In the moment, we are going to try and analyze these data. We will focus on
two questions labeled cut15
and numage
. cut15
is the question about
Brexit. The data dictionary has the variable label “CUT15 - How did you vote
on the question ‘Should the United Kingdom remain a member of the European
Union or leave the European Union’?”. The recorded values run from 1 through
6 and have the following labels:
Value label information for cut15
Value = 1.0 Label = Remain a member of the European Union
Value = 2.0 Label = Leave the European Union
Value = 3.0 Label = Did not vote
Value = 4.0 Label = Too young
Value = 5.0 Label = Can't remember
Value = 6.0 Label = Refused
We also want the variable numage
; this is the age of the respondent in years.
The data file that you just downloaded should be called
audit_of_political_engagement_14_2017.tab
. The cell below loads the data
file into memory with Pandas:
# Load the data frame, and put it in the variable "audit_data"
audit_data = pd.read_table('audit_of_political_engagement_14_2017.tab')
As you know, we now have a data frame:
type(audit_data)
pandas.core.frame.DataFrame
The data frame has one row per person surveyed, and one column for each question in the survey. The columns have kind-of helpful names that you can read about in the data dictionary:
audit_data
cu041 | cu042 | cu043 | cu044 | cu045 | cu046 | cu047 | cu048 | cu049 | cu0410 | ... | intten | cx_971_980 | serial | week | wts | numage | weight0 | sgrade_grp | age_grp | region2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | ... | -1 | 3.41659 | 1399 | 648 | 3.41659 | 37 | 3.41659 | 1 | 4 | 3 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | -1 | 2.68198 | 1733 | 648 | 2.68198 | 55 | 2.68198 | 2 | 6 | 3 |
2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 0.79379 | 1736 | 648 | 0.79379 | 71 | 0.79379 | 2 | 7 | 4 |
3 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | ... | -1 | 1.40580 | 1737 | 648 | 1.40580 | 37 | 1.40580 | 1 | 4 | 4 |
4 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | ... | -1 | 0.89475 | 1738 | 648 | 0.89475 | 42 | 0.89475 | 2 | 4 | 4 |
5 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 3.22535 | 1801 | 648 | 3.22535 | 0 | 3.22535 | 1 | 1 | 3 |
6 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 1.52922 | 1802 | 648 | 1.52922 | 69 | 1.52922 | 1 | 7 | 3 |
7 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 2.89655 | 1803 | 648 | 2.89655 | 20 | 2.89655 | 2 | 2 | 3 |
8 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 4.66393 | 1804 | 648 | 4.66393 | 38 | 4.66393 | 1 | 4 | 3 |
9 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | ... | -1 | 1.43732 | 1806 | 648 | 1.43732 | 60 | 1.43732 | 4 | 6 | 3 |
10 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 1.81109 | 1807 | 648 | 1.81109 | 0 | 1.81109 | 3 | 1 | 3 |
11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 4.18632 | 1808 | 648 | 4.18632 | 32 | 4.18632 | 1 | 3 | 3 |
12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 1.81382 | 1809 | 648 | 1.81382 | 79 | 1.81382 | 2 | 8 | 3 |
13 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | -1 | 1.73488 | 1811 | 648 | 1.73488 | 0 | 1.73488 | 4 | 1 | 4 |
14 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | ... | -1 | 1.22631 | 1812 | 648 | 1.22631 | 58 | 1.22631 | 1 | 6 | 4 |
15 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | ... | -1 | 1.73024 | 1813 | 648 | 1.73024 | 46 | 1.73024 | 1 | 5 | 4 |
16 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | -1 | 1.83130 | 1814 | 648 | 1.83130 | 51 | 1.83130 | 1 | 5 | 4 |
17 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 0.50172 | 1815 | 648 | 0.50172 | 74 | 0.50172 | 3 | 7 | 4 |
18 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | -1 | 3.33900 | 1816 | 648 | 3.33900 | 57 | 3.33900 | 1 | 6 | 3 |
19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 1.89094 | 1817 | 648 | 1.89094 | 61 | 1.89094 | 3 | 6 | 3 |
20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 3.80239 | 1818 | 648 | 3.80239 | 47 | 3.80239 | 2 | 5 | 3 |
21 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | -1 | 1.55589 | 1819 | 648 | 1.55589 | 56 | 1.55589 | 4 | 6 | 3 |
22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 2.73494 | 1820 | 648 | 2.73494 | 87 | 2.73494 | 2 | 8 | 3 |
23 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 3.18552 | 1821 | 648 | 3.18552 | 76 | 3.18552 | 1 | 8 | 3 |
24 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 4.38354 | 1822 | 648 | 4.38354 | 20 | 4.38354 | 1 | 2 | 3 |
25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 0.63495 | 1823 | 648 | 0.63495 | 35 | 0.63495 | 3 | 4 | 4 |
26 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | ... | -1 | 0.48390 | 1824 | 648 | 0.48390 | 28 | 0.48390 | 2 | 3 | 1 |
27 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 0.85149 | 1825 | 648 | 0.85149 | 38 | 0.85149 | 2 | 4 | 1 |
28 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 1.12183 | 1828 | 648 | 1.12183 | 44 | 1.12183 | 3 | 4 | 3 |
29 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 2.38711 | 1829 | 648 | 2.38711 | 38 | 2.38711 | 2 | 4 | 3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1741 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.45925 | 3262 | 649 | 0.45925 | 35 | 0.45925 | 3 | 4 | 4 |
1742 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.97481 | 3284 | 649 | 0.97481 | 39 | 0.97481 | 3 | 4 | 6 |
1743 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.97507 | 3291 | 649 | 0.97507 | 39 | 0.97507 | 1 | 4 | 5 |
1744 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 1.81637 | 3295 | 649 | 1.81637 | 44 | 1.81637 | 2 | 4 | 5 |
1745 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 9 | 0.13786 | 3329 | 649 | 0.13786 | 18 | 0.13786 | 2 | 2 | 6 |
1746 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.34092 | 3344 | 649 | 0.34092 | 40 | 0.34092 | 1 | 4 | 5 |
1747 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 1.09296 | 3364 | 649 | 1.09296 | 60 | 1.09296 | 3 | 6 | 4 |
1748 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 1.16371 | 3367 | 649 | 1.16371 | 36 | 1.16371 | 2 | 4 | 4 |
1749 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 9 | 0.98811 | 3368 | 649 | 0.98811 | 72 | 0.98811 | 4 | 7 | 4 |
1750 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.36535 | 3370 | 649 | 0.36535 | 70 | 0.36535 | 4 | 7 | 6 |
1751 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.76167 | 3372 | 649 | 0.76167 | 31 | 0.76167 | 1 | 3 | 6 |
1752 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 0.24729 | 3377 | 649 | 0.24729 | 20 | 0.24729 | 4 | 2 | 6 |
1753 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 6 | 0.17248 | 3378 | 649 | 0.17248 | 67 | 0.17248 | 4 | 7 | 6 |
1754 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 5 | 0.22854 | 3380 | 649 | 0.22854 | 54 | 0.22854 | 4 | 5 | 6 |
1755 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.39888 | 3382 | 649 | 0.39888 | 18 | 0.39888 | 4 | 2 | 6 |
1756 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.17631 | 3384 | 649 | 0.17631 | 18 | 0.17631 | 2 | 2 | 6 |
1757 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.32814 | 3386 | 649 | 0.32814 | 24 | 0.32814 | 4 | 2 | 6 |
1758 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.21588 | 3388 | 649 | 0.21588 | 20 | 0.21588 | 2 | 2 | 6 |
1759 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.61616 | 3389 | 649 | 0.61616 | 36 | 0.61616 | 3 | 4 | 6 |
1760 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.21250 | 3390 | 649 | 0.21250 | 42 | 0.21250 | 4 | 4 | 6 |
1761 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.32776 | 3392 | 649 | 0.32776 | 37 | 0.32776 | 1 | 4 | 6 |
1762 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 7 | 0.27406 | 3394 | 649 | 0.27406 | 19 | 0.27406 | 4 | 2 | 6 |
1763 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.63120 | 3399 | 649 | 0.63120 | 36 | 0.63120 | 1 | 4 | 6 |
1764 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 9 | 0.65792 | 3407 | 649 | 0.65792 | 67 | 0.65792 | 4 | 7 | 4 |
1765 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.54415 | 3422 | 649 | 0.54415 | 40 | 0.54415 | 2 | 4 | 4 |
1766 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.44339 | 3423 | 649 | 0.44339 | 39 | 0.44339 | 1 | 4 | 4 |
1767 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.44086 | 3425 | 649 | 0.44086 | 20 | 0.44086 | 3 | 2 | 4 |
1768 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.32590 | 3426 | 649 | 0.32590 | 31 | 0.32590 | 3 | 3 | 4 |
1769 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.66970 | 3427 | 649 | 0.66970 | 47 | 0.66970 | 4 | 5 | 4 |
1770 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 0.39478 | 3434 | 649 | 0.39478 | 25 | 0.39478 | 4 | 3 | 4 |
1771 rows × 370 columns
The data frame has columns for all the questions listed in the data dictionary:
audit_data.columns
Index(['cu041', 'cu042', 'cu043', 'cu044', 'cu045', 'cu046', 'cu047', 'cu048',
'cu049', 'cu0410',
...
'intten', 'cx_971_980', 'serial', 'week', 'wts', 'numage', 'weight0',
'sgrade_grp', 'age_grp', 'region2'],
dtype='object', length=370)
To reduce clutter, we first make a new data frame that just has the two questions we are interested in:
# Select the age and Brexit vote questions only
mini_brexit = pd.DataFrame()
mini_brexit['numage'] = audit_data['numage']
mini_brexit['cut15'] = audit_data['cut15']
mini_brexit
numage | cut15 | |
---|---|---|
0 | 37 | 1 |
1 | 55 | 1 |
2 | 71 | 2 |
3 | 37 | 1 |
4 | 42 | 1 |
5 | 0 | 1 |
6 | 69 | 1 |
7 | 20 | 1 |
8 | 38 | 1 |
9 | 60 | 2 |
10 | 0 | 1 |
11 | 32 | 1 |
12 | 79 | 3 |
13 | 0 | 2 |
14 | 58 | 1 |
15 | 46 | 1 |
16 | 51 | 1 |
17 | 74 | 2 |
18 | 57 | 1 |
19 | 61 | 2 |
20 | 47 | 2 |
21 | 56 | 2 |
22 | 87 | 1 |
23 | 76 | 2 |
24 | 20 | 3 |
25 | 35 | 2 |
26 | 28 | 1 |
27 | 38 | 1 |
28 | 44 | 2 |
29 | 38 | 2 |
... | ... | ... |
1741 | 35 | 1 |
1742 | 39 | 3 |
1743 | 39 | 1 |
1744 | 44 | 1 |
1745 | 18 | 4 |
1746 | 40 | 1 |
1747 | 60 | 3 |
1748 | 36 | 3 |
1749 | 72 | 1 |
1750 | 70 | 1 |
1751 | 31 | 3 |
1752 | 20 | 3 |
1753 | 67 | 1 |
1754 | 54 | 2 |
1755 | 18 | 2 |
1756 | 18 | 4 |
1757 | 24 | 3 |
1758 | 20 | 2 |
1759 | 36 | 3 |
1760 | 42 | 3 |
1761 | 37 | 1 |
1762 | 19 | 5 |
1763 | 36 | 6 |
1764 | 67 | 1 |
1765 | 40 | 2 |
1766 | 39 | 1 |
1767 | 20 | 3 |
1768 | 31 | 2 |
1769 | 47 | 3 |
1770 | 25 | 3 |
1771 rows × 2 columns
To get started on exploring, we make a new variable ages
that refers
to the numage
column in the mini_brexit
data frame.
# Make a new variable "ages" that refers to the "numage" column in "mini_brexit"
ages = mini_brexit["numage"]
Confirm that ages
has a value of type Series
, the Pandas type for a column of a data frame:
type(ages)
pandas.core.series.Series
Here are the numbers of rows, columns in the original data frame:
audit_data.shape
(1771, 370)
Run the cell below to confirm that ages
has the same number of values, as
audit_data
has rows. To do this, we can use the len
function, as applied
to the ages
Series. It returns the number of values.
len(ages)
1771
In fact, len
, as applied to the data frame, returns the number of rows:
len(audit_data)
1771
Start by doing a histogram of the values in ages
(which are also the values
in the numage
column of mini_brexit
). If you can’t remember how to do
histograms, have a look at the introduction to data
frames notebook. Hint: consider using the hist
method of the ages
variable.
# Do a histogram of the values in the "numage" column.
# Your code here.
You will see that a few subjects have an age of 0.
It looks as if the survey coders are using the value 0 to mean that the person did not state their age. We will have to clean that up. We do that by selecting the cases that have ages not equal to 0.
Hint: You have seen the operator to say whether two values are equal or no:
1 == 2
False
2 == 2
True
The operator for not equal is !=
, as in:
1 != 2
True
Prepare for brain-bending double negative…
2 != 2
False
To identify the values in ages
that are not equal to 0, use the comparison
I’ve hinted at above, to make a new variable, age_not_0
, that has the same
number of values as ages
, and has True
at positions where ages
is not
equal to 0, and False
otherwise. We will refer to these sequences of True
and False values, as Boolean vectors.
Check back to the introduction to data frames notebook for a reminder of making and using Boolean vectors to select rows from data frames.
# Create new variable "age_not_0", with True at positions where "ages" is not
# equal to 0, and False otherwise.
# age_not_0 = ?
Use age_not_0
to select rows in the mini_brexit
data frame where the value
is True
, and throw away the rows where the value is False
. To do this, index into the data frame with the Boolean Series you created above.
# Select rows in the data frame where the age is not equal to 0.
# Make a new data # frame called "good_brexit" that only contains these rows.
# good_brexit = ...
Now we want to ask what proportion of the respondents said that they voted Remain or Leave.
First we make a new data frame that contains only the rows for people who said they voted No in the referendum (remain). Remember, from the data dictionary, that 1 is the code for a No vote.
First, make a new variable votes
that has the values of cut15
column of the
good_brexit
data frame.
# Make a new variable "votes" that refers to the "cut15" column in "good_brexit".
# Your code will start with
# votes = ?
Now make a new Boolean vector, that has True at the positions where votes
is
equal to 1, and False otherwise. Call this variable is_remain
.
# Make a Boolean vector, called "is_remain", that True for Remain row, False
# otherwise.
# is_remain = ?
Next, use is_remain
to select the rows in good_brexit
that correspond to
confessed “Remain” voters. Call the new data frame remainers
:
# Select the rows from "good_brexit" that correspond to Remain voters
# remainers = good_brexit?
Do a histogram of the values in the numage
column of remainers
:
# Show a histogram of the `numage` column from `remainers`
Next, go through the same steps, to make a new data frame for those who claimed to vote Yes (leave) (code 2):
# Make a Boolean vector, called "is_leave", that True for Leave row, False
# otherwise.
# is_leave = ?
Next, use is_leave
to select the rows in good_brexit
that correspond to confessed “Leave” voters. Call the new data frame leavers
:
# Select the rows from "good_brexit" that correspond to Leave voters
# leavers = good_brexit?
Do a histogram of the values in the numage
column of leavers
:
# Show a histgram of the `numage` column from `remainers`
Uncomment the lines in the cell below to get the total number of Remain voters:
# n_remain = len(remainers)
# n_remain
Here is the total number of Leave voters:
# n_leave = len(leavers)
# n_leave
Here is the total number of voters who confessed to a specific Leave or Remain vote:
# n_total = n_leave + n_remain
# n_total
Here is the proportion of Leave voters:
# leave_proportion = n_leave / n_total
# leave_proportion
As you remember, the proportion of Leave voters in the referendum was 51.9%.
leave_proportion
from the survey seems a way off. Is it too far off?
You go back to the survey company and tell them that the proportion of Leave voters seems too low.
They say the following:
We took a random sample of the population. You are a data scientist, you know well that the proportion from this random sample is very unlikely to be exactly the same as the proportion in the whole population. The proportion we get is compatible with the variation we expect from taking a random sample.
In other words - the difference in the proportions, between the referendum and the survey, is due to sampling error.
Time for a simulation.
The null hypothesis offered by the survey company is that the proportion we saw
above is a plausible value if we took a random sample of n_total
voters.
We can simulate a new survey, with n_total
voters, by taking n_total
random
numbers between 0 and 1. We consider the values less than 0.52 as
corresponding to a Leave vote, and the rest are Remain votes. We then
calculate the proportion of Leave votes (proportion of values where value <
0.519 == True).
We do this 10000 times, to get 10000 simulated surveys. We calculate the
proportions for each simulated survey, and do a histogram of the proportions.
Is leave_proportion
a plausible value on this histogram?
See:
to remind yourself about simulations.
# Your simulation here