\(\newcommand{L}[1]{\| #1 \|}\newcommand{VL}[1]{\L{ \vec{#1} }}\newcommand{R}[1]{\operatorname{Re}\,(#1)}\newcommand{I}[1]{\operatorname{Im}\, (#1)}\)

Schooling and fertility

Here we are analyzing this dataset from the World Bank on gender and inequality:

You can download the data yourself as a zip file from that site, but to make your life a little easier, I’ve made a link to the extracted data file:

Download this file to the same directory as your Jupyter Notebook.

>>> # Import Pandas with its usual short name
>>> import pandas as pd

Load the Comma Separated Value text file into Pandas as a data frame:

>>> df = pd.read_csv('Gender_StatsData.csv')

This is a slightly clumsy-looking data frame, because it has years for columns, and variables for rows, where there are 630 variables for each country. So there are 630 rows * the number of countries. To investigate, we first look at the column names:

>>> df.columns
Index(['"Country Name"', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', 'Unnamed: 62'],
      dtype='object')

Next we look at the data frame itself:

>>> df
       "Country Name" Country Code  \
0           Arab World          ARB ...
1           Arab World          ARB ...
2           Arab World          ARB ...
3           Arab World          ARB ...
4           Arab World          ARB ...
5           Arab World          ARB ...
...

[165690 rows x 63 columns]

There are lots of countries here, so to start, let’s look at the variables for the UK.

We get the UK country code from http://www.worldatlas.com/aatlas/ctycodes.htm. The code is GBR.

>>> # We select only the UK rows
>>> gb = df[df['Country Code'] == 'GBR']
>>> gb
       "Country Name" Country Code  \
158130  United Kingdom          GBR ...
158131  United Kingdom          GBR ...
158132  United Kingdom          GBR ...
158133  United Kingdom          GBR ...
158134  United Kingdom          GBR ...
158135  United Kingdom          GBR ...
...

[630 rows x 63 columns]

Pandas truncates the output to only show a certain number of rows, and only a certain length for the text fields. To investigate further, you can increase these limits to see all 630 rows for the UK, and more of the text for the text fields:

>>> # See more of the text, more rows in the displayed output
>>> pd.options.display.max_colwidth = 80
>>> pd.options.display.max_rows = 700

If you are working in the Notebook, you will now see all of the rows and the whole text field with the variable description.

>>> # This will be different from above when working in the Notebook
>>> gb
       "Country Name" Country Code  \
158130  United Kingdom          GBR ...
158131  United Kingdom          GBR ...
158132  United Kingdom          GBR ...
158133  United Kingdom          GBR ...
158134  United Kingdom          GBR ...
158135  United Kingdom          GBR ...
...

[630 rows x 63 columns]

After scanning through this output, we decide we want to look at these two variables:

  • SP.ADO.TFRT : Adolescent fertility rate (births per 1,000 women ages 15-19)
  • SE.SCH.LIFE.FE : Expected years of schooling, female

Remember that, for each variable there is one row per country. We have to select the rows corresponding to this variable, for all countries. We do that for both variables we are interested in.

>>> # Put the variables we want into their own data frames
>>> fertility_rate = df[df['Indicator Code'] == 'SP.ADO.TFRT']
>>> age_female_sch = df[df['Indicator Code'] == 'SE.SCH.LIFE.FE']

For convenience, we are going to put the values from these columns back into their own data frame. To do that, we make a new data frame, and fill it with the values from our variables:

>>> # Make a new data frame to store our school and fertility columns
>>> school_fertility = pd.DataFrame(columns=['school', 'fertility'])
>>> school_fertility
Empty DataFrame
Columns: [school, fertility]
Index: []

We fill the empty data frame with the values we selected:

>>> school_fertility['school'] = age_female_sch['2014'].values
>>> school_fertility['fertility'] = fertility_rate['2014'].values
>>> school_fertility
        school   fertility
0    10.993670   48.479108
1          NaN   57.160052
2    16.449970   19.519840
3    11.816701   41.685074
4    13.469607   21.358688
5    13.257181   22.752998
...

We do not have the expected length of schooling for quite a few contries; these are the rows with NaN in the corresponding columns. “NaN” stands for Not a Number. We often call these missing values. They can also be called “NA” for “Not Applicable”. There are some missing values for the fertility statistic too. For simplicity, we drop rows that have missing values in either of these two columns:

>>> # Drop the rows with any missing (NaN) values
>>> school_fertility = school_fertility.dropna()
>>> school_fertility
        school   fertility
0    10.993670   48.479108
2    16.449970   19.519840
3    11.816701   41.685074
4    13.469607   21.358688
5    13.257181   22.752998
6    13.280039   23.077289
...

Now we have the data we think we want, we do a quick check for odd values by doing a histogram of each column:

Hint

If running in the IPython console, consider running %matplotlib to enable interactive plots. If running in the Jupyter Notebook, use %matplotlib inline.

>>> # Import the plotting library with a memorable name
>>> import matplotlib.pyplot as plt
>>> # Histogram of the school length
>>> plt.hist(school_fertility['school']);
(...)

(png, hires.png, pdf)

_images/school_and_fertility-14.png
>>> # Histogram of the fertility rate
>>> plt.hist(school_fertility['fertility']);
(...)

(png, hires.png, pdf)

_images/school_and_fertility-15.png

Now to the part we are interested in. We speculate that countries that don’t keep their women in school for long, may have higher adolescent fertility rates. We plot length of schooling on the X axis and fertility rates on the Y axis:

>>> # Scatterplot of school length against fertility rate
>>> plt.scatter(school_fertility['school'], school_fertility['fertility'])
<...>

(png, hires.png, pdf)

_images/school_and_fertility-16.png

We save the data frame we made to a file, so we can use these values again. When saving, we drop off the first implicit column, with the case numbers:

>>> school_fertility.to_csv('school_fertility.csv', index=False)

As usual, we make lists for our two sets of numbers:

>>> school = list(school_fertility['school'])
>>> fertility = list(school_fertility['fertility'])

We have speculated that, as the number of years in school goes up, the number of adolescents having children goes down.

Remember What order is best?? If school and fertility are perfectly arranged, with low values for school going with high values of fertility, we will get a high list_product:

>>> def list_product(first_list, second_list):
...     product = 0
...     for i in range(len(first_list)):
...         value = first_list[i] * second_list[i]
...         product = product + value
...     return product

So, what do we get for our observed lists?

>>> observed_product = list_product(school, fertility)

What do we compare this to? What is our sampling distribution?

We need our null hypothesis again.

Our null hypothesis is that there is no systematic shared ordering to school and fertility. Any appearance of shared ordering is just because of random sampling variation in school and fertility.

We can get an idea of sampling variation by doing something very similar to what we did in Comparing two groups with permutation testing. We can permute the fertility list to give a random sample of fertility values. So, here is one trial:

>>> #: The random module
>>> import random
>>> def one_product(list_1, list_2):
...     # We don't want to change list_2, so make a copy
...     new_list_2 = list_2.copy()
...     random.shuffle(new_list_2)
...     return list_product(list_1, new_list_2)
>>> one_product(school, fertility)
75196.897957675363
>>> one_product(school, fertility)
72668.685532226926

Now let’s build up the sampling distribution:

>>> n_repeats = 10000
>>> sample_products = []
>>> for i in range(n_repeats):
...     product = one_product(school, fertility)
...     sample_products.append(product)
>>> plt.hist(sample_products)
(...)

(png, hires.png, pdf)

_images/school_and_fertility-28.png