####################### Schooling and fertility ####################### .. code-links:: clear Here we are analyzing this dataset from the World Bank on gender and inequality: * https://data.worldbank.org/data-catalog/gender-statistics You can download the data yourself as a zip file `from that site `_, but to make your life a little easier, I've made a link to the extracted data file: * :download:`Gender_StatsData.csv` Download this file to the same directory as your Jupyter Notebook. .. nbplot:: >>> # Import Pandas with its usual short name >>> import pandas as pd Load the Comma Separated Value text file into Pandas as a data frame: .. nbplot:: >>> df = pd.read_csv('Gender_StatsData.csv') This is a slightly clumsy-looking data frame, because it has years for columns, and variables for rows, where there are 630 variables for each country. So there are 630 rows * the number of countries. To investigate, we first look at the column names: .. nbplot:: >>> df.columns Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code', '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', 'Unnamed: 62'], dtype='object') Next we look at the data frame itself: .. nbplot:: >>> df Country Name Country Code ... 0 Arab World ARB ... 1 Arab World ARB ... 2 Arab World ARB ... 3 Arab World ARB ... 4 Arab World ARB ... 5 Arab World ARB ... ... [165690 rows x 63 columns] There are lots of countries here, so to start, let's look at the variables for the UK. We get the UK country code from http://www.worldatlas.com/aatlas/ctycodes.htm. The code is ``GBR``. .. nbplot:: >>> # We select only the UK rows >>> gb = df[df['Country Code'] == 'GBR'] >>> gb Country Name Country Code ... 158130 United Kingdom GBR ... 158131 United Kingdom GBR ... 158132 United Kingdom GBR ... 158133 United Kingdom GBR ... 158134 United Kingdom GBR ... 158135 United Kingdom GBR ... ... [630 rows x 63 columns] Pandas truncates the output to only show a certain number of rows, and only a certain length for the text fields. To investigate further, you can increase these limits to see all 630 rows for the UK, and more of the text for the text fields: .. nbplot:: :hide-from: all :show-to: doctest >>> max_colwidth = pd.options.display.max_colwidth >>> max_rows = pd.options.display.max_rows .. nbplot:: >>> # See more of the text, more rows in the displayed output >>> pd.options.display.max_colwidth = 80 >>> pd.options.display.max_rows = 700 If you are working in the Notebook, you will now see all of the rows and the whole text field with the variable description. .. nbplot:: >>> # This will be different from above when working in the Notebook >>> gb Country Name Country Code ... 158130 United Kingdom GBR ... 158131 United Kingdom GBR ... 158132 United Kingdom GBR ... 158133 United Kingdom GBR ... 158134 United Kingdom GBR ... 158135 United Kingdom GBR ... ... [630 rows x 63 columns] After scanning through this output, we decide we want to look at these two variables: * ``SP.ADO.TFRT`` : Adolescent fertility rate (births per 1,000 women ages 15-19) * ``SE.SCH.LIFE.FE`` : Expected years of schooling, female Remember that, for each variable there is one row per country. We have to select the rows corresponding to this variable, for all countries. We do that for both variables we are interested in. .. nbplot:: >>> # Put the variables we want into their own data frames >>> fertility_rate = df[df['Indicator Code'] == 'SP.ADO.TFRT'] >>> age_female_sch = df[df['Indicator Code'] == 'SE.SCH.LIFE.FE'] For convenience, we are going to put the values from these columns back into their own data frame. To do that, we make a new data frame, and fill it with the values from our variables: .. nbplot:: >>> # Make a new data frame to store our school and fertility columns >>> school_fertility = pd.DataFrame(columns=['school', 'fertility']) >>> school_fertility Empty DataFrame Columns: [school, fertility] Index: [] We fill the empty data frame with the values we selected: .. nbplot:: >>> school_fertility['school'] = age_female_sch['2014'].values >>> school_fertility['fertility'] = fertility_rate['2014'].values >>> school_fertility school fertility 0 10.993670 48.479108 1 NaN 57.160052 2 16.449970 19.519840 3 11.816701 41.685074 4 13.469607 21.358688 5 13.257181 22.752998 ... We do not have the expected length of schooling for quite a few contries; these are the rows with ``NaN`` in the corresponding columns. "NaN" stands for Not a Number. We often call these *missing values*. They can also be called "NA" for "Not Applicable". There are some missing values for the fertility statistic too. For simplicity, we drop rows that have missing values in either of these two columns: .. nbplot:: >>> # Drop the rows with any missing (NaN) values >>> school_fertility = school_fertility.dropna() >>> school_fertility school fertility 0 10.993670 48.479108 2 16.449970 19.519840 3 11.816701 41.685074 4 13.469607 21.358688 5 13.257181 22.752998 6 13.280039 23.077289 ... Now we have the data we think we want, we do a quick check for odd values by doing a histogram of each column: .. mpl-interactive:: .. nbplot:: >>> # Import the plotting library with a memorable name >>> import matplotlib.pyplot as plt .. nbplot:: >>> # Histogram of the school length >>> plt.hist(school_fertility['school']); (...) .. nbplot:: >>> # Histogram of the fertility rate >>> plt.hist(school_fertility['fertility']); (...) Now to the part we are interested in. We speculate that countries that don't keep their women in school for long, may have higher adolescent fertility rates. We plot length of schooling on the X axis and fertility rates on the Y axis: .. nbplot:: >>> # Scatterplot of school length against fertility rate >>> plt.scatter(school_fertility['school'], school_fertility['fertility']) <...> We save the data frame we made to a file, so we can use these values again. When saving, we drop off the first implicit column, with the case numbers: .. nbplot:: >>> school_fertility.to_csv('school_fertility.csv', index=False) .. nbplot:: :hide-from: all :show-to: doctest >>> # Reset to Pandas defaults >>> pd.options.display.max_colwidth = max_colwidth >>> pd.options.display.max_rows = max_rows As usual, we make lists for our two sets of numbers: .. nbplot:: >>> school = list(school_fertility['school']) >>> fertility = list(school_fertility['fertility']) We have speculated that, as the number of years in school goes up, the number of adolescents having children goes down. Remember :doc:`what_order_is_best`? If ``school`` and ``fertility`` are perfectly arranged, with low values for ``school`` going with high values of ``fertility``, we will get a high ``list_product``: .. nbplot:: >>> def list_product(first_list, second_list): ... product = 0 ... for i in range(len(first_list)): ... value = first_list[i] * second_list[i] ... product = product + value ... return product So, what do we get for our observed lists? .. nbplot:: >>> observed_product = list_product(school, fertility) What do we compare this to? What is our sampling distribution? We need our null hypothesis again. Our null hypothesis is that there is no systematic shared ordering to ``school`` and ``fertility``. Any appearance of shared ordering is just because of random sampling variation in ``school`` and ``fertility``. We can get an idea of sampling variation by doing something very similar to what we did in :doc:`brexit_ages`. We can permute the ``fertility`` list to give a random sample of fertility values. So, here is one trial: .. nbplot:: >>> #: The random module >>> import random .. nbplot:: :hide-from: all :show-to: doctest >>> random.seed(1966) .. nbplot:: >>> def one_product(list_1, list_2): ... # We don't want to change list_2, so make a copy ... new_list_2 = list_2.copy() ... random.shuffle(new_list_2) ... return list_product(list_1, new_list_2) .. nbplot:: >>> one_product(school, fertility) 75196.8979576753... .. nbplot:: >>> one_product(school, fertility) 72668.6855322269... Now let's build up the sampling distribution: .. nbplot:: >>> n_repeats = 10000 >>> sample_products = [] >>> for i in range(n_repeats): ... product = one_product(school, fertility) ... sample_products.append(product) .. nbplot:: >>> plt.hist(sample_products) (...)