\(\newcommand{L}[1]{\| #1 \|}\newcommand{VL}[1]{\L{ \vec{#1} }}\newcommand{R}[1]{\operatorname{Re}\,(#1)}\newcommand{I}[1]{\operatorname{Im}\, (#1)}\)

Course outline

Introduction

See the talk slides [1].

The first problem - a Brexit survey

First we analyze a survey of British voters after Brexit - see Analyzing Brexit.

We find that, of respondents who were prepared to say how they voted in the Brexit referendum, 41% said they voted Leave, and 59% said they voted Remain.

But we know that 52% of UK voters voted Leave in the referendum.

Let’s say we have complained about this discrepancy to the survey company, and they say, that the difference is due to “sampling error”. They mean that, they took a random sample of UK voters, but, just by chance, that sample had rather fewer Leave voters than the whole UK population.

That’s their claim. But can that really be true?

Solving with simulation

We need to think about what random sampling means.

To start, let’s think about a simpler problem.

Here’s the problem:

If a family has 4 children, what is the probability that the family has 3 girls?

We can solve this by drawing out the different possibilities in a probability tree, but then we have to do some calculations with probability, and these can be difficult, and easy to get wrong, especially in harder cases than this.

Another way to solve it, is with simulation.

Let’s say each child has a 50% chance of being a girl. We can simulate a birth with the toss of a coin.

We can simulate four children by four coin tosses. Call this a single trial. If we do one trial, 100 times, we can count the number of trials that have 3 heads, and divide by 100, to estimate the probability of three girls.

The algorithm is something like:

Set the success counter to 0
Repeat the following 100 times:
    Repeat the following 4 times:
        Toss a coin, record Heads or Tails
    If the number of Heads equals 3:
        Increase the success counter by 1
Show success counter divided by 100

We do that with actual coin tosses, and get an answer close to 25 / 100 - a probability of 0.25.

This all sounds like something the computer would be good at. Let’s do the same simulation using some code.

First we need:

Now we can do the same simulation, with code - see Three girls in a family of four;

Brexit sampling by simulation

We go back to the problem of “sampling error” in the Brexit survey.

1315 respondents said how they voted in the referendum, and 541 said they voted Leave - about 41%. This is well off the final UK result of 52%.

When the survey company claim this is sampling error, they are saying that their random sample of UK voters just happened to pick up a smaller proportion of Leave voters.

In statistical terms, everyone who voted in the referendum is the population. We know that, in this population, 52% voted Leave. The 1315 survey respondents who said how they voted, are a sample from that population.

If that sample is random, so that every UK voter had an equal chance of being in that 1315 sample, then, for each respondent, there is a 52% chance that they voted Leave.

So, if the chance that any one voter voted Leave, is 52%, what are the chances that, if we take a sample of 1315 voters, only 41% voted Leave?

This is the same problem as the problem of three girls in a family of four, except:

  • our coin toss gave a 50% chance of being a girl, but our voter has a slightly greater chance (52%) of being a Leave voter and
  • we have 1315 voters in our trial instead of four children.

We are going to use the computer to do many thousands of trials, each with 1315 simulated voters. Each trial will give a proportion of Leave voters. For each trial, the proportion will be slightly different, because we will get more or less Leave voters by chance, on each trial.

If we collect the proportion on each of these thousand trials, we can build up a distribution of proportions that we would expect when taking these 1315 voter trials.

This is the sampling distribution of the proportion - that is, the distribution of the proportions we expect to get, when we take many samples of size 1315.

Where does 41% fall in this distribution? Is it similar to a reasonable number of the values we get by simulation, or is it way off?

See Testing Brexit Proportions.

As a side note - how does this sampling distribution change, as we have more or less voters in our sample? Have a look at Spread of a distribution with number of samples.

Comparing two groups and permutation

Looking again at Analyzing Brexit - we see that the age distribution of Leave voters looks different from the age distribution of Remain voters. For example, the mean age of the Leave voters is higher than the mean age of the Remain voters.

Could this be sampling error too? Could it be true that, in all UK referendum voters, the mean age of the Leave voters was more or less the same as the mean age of the Remain voters? Could it be that, in our random sample, we just happen to have some older Leave voters or younger Remain voters?

What we would like to do, is take many samples of 1315 voters, where 541 voted Leave, and 774 voted Remain. We would take the mean of each set of voters, and subtract them to find the difference. We’d do that many times to build up the distribution of differences, and then compare the difference that we’d actually got, with the distribution we found from many samples. If the difference that we actually got is very large compared to this distribution, then we can conclude that a simple accident of sampling is not the explanation of our difference.

But, we do not have lots of samples of 1315 voters to build up this distribution.

Happily, there is a trick we can use to get something very close to this distribution. That trick is to use the sample we have over and over again to make a simulated random sample. We do this by pooling all 1315 voters into a single group, shuffling them randomly, and then splitting them into a fake group of 541 Leave voters and a fake sample of 774 remain voters, where these new fake samples have a random mixture of actual Leave voters and actual Remain voters. We calculate the mean age difference for these two fake samples, and then continue to repeat the procedure, many times. This gives us something very similar to the distribution we would expect if we really had taken many new samples of 1315.

To do this shuffling, we need More on working with lists.

To see this permutation test in action: Comparing two groups with permutation testing.

Paired tests with permutation

We can also do paired tests with permutation. A paired test is where we have two measures from the same individual. We know that the individual differs from other individuals, but we might wonder if the two measures differ. If they don’t differ, then we can build up our sampling distribution, by swapping the two measures randomly, within each individual - we calculate the difference between the measures for the individual, and then the mean of these differences across individuals. This is one trial - the mean goes into our sampling distribution, and we do this many times to build up the whole distribution.

See paired tests in Attitudes to animal research.

Testing for straight-line relationships with permutation

Look at the data described at the top of Schooling and fertility. This dataset has measures of gender inequality across countries. There seems to be some sort of straight-line relationship in these measures between the average number of years a girl stays in school, and the number of children born to teenage mothers.

But - could this be due to sampling? If we randomly paired the average years in school with the number of children, would we see a relationship like this appear often, or rarely?

To think about what measure we would use for a straight-line relationship, see * What order is best?.

Now look at the test of linear relationships at then end of Schooling and fertility.

[1]Source files for building the slides, and source for all the pages in this website are always available in the course Github repository.