# Permutation and the t-test¶

In the idea of permutation, we use permutation to compare a difference between two groups of numbers.

In our case, each number corresponded to one person in the study. The number for each subject was the number of mosquitoes flying towards them. The subjects were from two groups: people who had just drunk beer, and people who had just drunk water. There were 25 subjects who had drunk beer, and therefore, 25 numbers of mosquitoes corresponding to the “beer” group. There were 18 subjects who had drunk water, and 18 numbers corresponding to the “water” group.

Here we repeat the permutation test, as a reminder.

As before, you can download the data from
`mosquito_beer.csv`

.

See this page for more details on the dataset, and the data license page.

```
# Import Numpy library, rename as "np"
import numpy as np
# Import Pandas library, rename as "pd"
import pandas as pd
# Safe setting for Pandas.
pd.set_option('mode.chained_assignment', 'raise')
# Set up plotting
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
```

Read in the data, get the numbers of mosquitoes flying towards the beer drinkers, and towards the water drinkers, after they had drunk their beer or water. See the the idea of permutation page.

```
# Read in the data, select beer and water values.
mosquitoes = pd.read_csv('mosquito_beer.csv')
after_rows = mosquitoes[mosquitoes['test'] == 'after']
beer_rows = after_rows[after_rows['group'] == 'beer']
beer_activated = np.array(beer_rows['activated'])
water_rows = after_rows[after_rows['group'] == 'water']
water_activated = np.array(water_rows['activated'])
```

There are 25 values in the beer group, and 18 in the water group:

```
print('Number in beer group:', len(beer_activated))
print('Number in water group:', len(water_activated))
```

```
Number in beer group: 25
Number in water group: 18
```

We are interested in the difference between the means of these numbers:

```
observed_difference = np.mean(beer_activated) - np.mean(water_activated)
observed_difference
```

```
4.433333333333334
```

In the permutation test we simulate a ideal (null) world in which there is no
average difference between the numbers in the two groups. We do this by
pooling the beer and water numbers, shuffling them, and then making fake beer
and water groups when we know, from the shuffling, that the average difference
will, in the long run, be zero. By doing this shuffle, sample step many times
we build up the distribution of the average difference. This is the *sampling
distribution* of the mean difference:

```
pooled = np.append(beer_activated, water_activated)
n_iters = 10000
fake_differences = np.zeros(n_iters)
for i in np.arange(n_iters):
shuffled = np.random.permutation(pooled)
fake_differences[i] = np.mean(shuffled[:25]) - np.mean(shuffled[25:])
```

Here’s the histogram. This time we have given the plot a title, using the
`plt.title`

function.

```
plt.hist(fake_differences)
plt.title('Sampling distribution of difference of means');
```

We can work out the proportion of the sampling distribution that is greater than or equal to the observed value, to get an estimate of the probability of the observed value, if we are in fact in the null (ideal) world:

```
permutation_p = np.count_nonzero(
fake_differences >= observed_difference)/ n_iters
permutation_p
```

```
0.0559
```

Remember that the *standard deviation* is a measure of the spread of a
distribution. We go into the standard deviation in more detail later in the
course, but for now, we just use Numpy to
calculate the standard deviation.

```
sampling_sd = np.std(fake_differences)
sampling_sd
```

```
2.738939592577363
```

Here is the distribution of the `fake_differences`

again, showing the mean plus
and minus one standard deviation. The standard deviation is a measure of how
spread out the distribution is, around its mean.

```
plt.hist(fake_differences)
fake_mean = np.mean(fake_differences)
# Red dot just above x-axis at +/- one standard deviation.
plt.plot([fake_mean - sampling_sd, fake_mean + sampling_sd], [50, 50], 'or')
plt.title('Sampling distribution +/- one standard deviation');
```

We can use the standard deviation as a unit of distance in the distribution.

A way of getting an idea of how extreme the observed value is, is to ask how many standard deviations the observed value is from the center of the distribution, which is zero.

```
like_t = observed_difference / sampling_sd
like_t
```

```
1.6186312926900053
```

Notice the variable name `like_t`

. This number is rather like the famous t
statistic.

The difference between this `like_t`

value and the *t statistic* is that the t
statistic is the observed difference divided by another *estimate* of the
standard deviation of the sampling distribution. Specifically it is an
estimate that relies on the assumption that the `beer_activated`

and
`water_activated`

numbers come from a simple bell-shaped normal
distribution.

The specific calculation relies on calculating the *prediction errors* when we
use the mean from each group as the prediction for the values in the group.

```
beer_errors = beer_activated - np.mean(beer_activated)
water_errors = water_activated - np.mean(water_activated)
all_errors = np.append(beer_errors, water_errors)
```

The estimate for the standard deviation of the sampling distribution follows this formula. The derivation of the formula is well outside the scope of the class.

```
# The t-statistic estimate.
n1 = len(beer_activated)
n2 = len(water_activated)
est_error_sd = np.sqrt(np.sum(all_errors ** 2) / (n1 + n2 - 2))
sampling_sd_estimate = est_error_sd * np.sqrt(1 / n1 + 1 / n2)
sampling_sd_estimate
```

```
2.7028390172904366
```

Notice that this is rather similar to the estimate we got directly from the permutation distribution:

```
sampling_sd
```

```
2.738939592577363
```

The t statistic is the observed mean difference divided by the estimate of the standard deviation of the sampling distribution.

```
t_statistic = observed_difference / sampling_sd_estimate
t_statistic
```

```
1.640250605001883
```

This is the same t statistic value calculated by the *independent sample t
test* routine from Scipy:

```
from scipy.stats import ttest_ind
t_result = ttest_ind(beer_activated, water_activated)
t_result.statistic
```

```
1.6402506050018828
```

The equivalent probability from a t test is also outside the scope of the course, but, if the data we put into the t test is more or less compatible with a normal distribution, then the matching p value is similar to that of the permutation test.

```
# The "one-tailed" probability from the t-test.
t_result.pvalue / 2
```

```
0.054302080886695414
```

```
# The permutation p value is very similar.
permutation_p
```

```
0.0559
```

The permutation test is more general than the t test, because the t test relies on the assumption that the numbers come from a normal distribution, but the permutation test does not.

Of course, you should should not believe these assertions without evidence, so
your next step is to use the *simulation* tools you have learned, to test the
t-test.