Missing values

# Load the Numpy array library, call it 'np'
import numpy as np
# Load the Pandas data science library, call it 'pd'
import pandas as pd
# Turn on a setting to use Pandas more safely.
pd.set_option('mode.chained_assignment', 'raise')

If you are running on your laptop, you should download the gender_stats.csv file to the same directory as this notebook.

See the gender statistics description page for more detail on the dataset.

# Load the data file
gender_data = pd.read_csv('gender_stats.csv')
gender_data.head()
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
0 Aruba ABW 1.66325 NaN NaN NaN 48.721939 NaN 0.103744
1 Afghanistan AFG 4.95450 19.961015 161.138034 2.834598 40.109708 444.00 32.715838
2 Angola AGO 6.12300 111.936542 254.747970 2.447546 NaN 501.25 26.937545
3 Albania ALB 1.76925 12.327586 574.202694 2.836021 47.201082 29.25 2.888280
4 Andorra AND NaN 3.197538 4421.224933 7.260281 47.123345 NaN 0.079547
# Get the GDP values as a Pandas Series
gdp = gender_data['gdp_us_billion']
gdp.head()
0           NaN
1     19.961015
2    111.936542
3     12.327586
4      3.197538
Name: gdp_us_billion, dtype: float64

Missing values and NaN

Looking at the values of gdp (and therefore, the values of the gdp_us_billion column of gender_data, we see that some of the values are NaN, which means Not a Number. Pandas uses this marker to indicate values that are not available, or missing data.

Numpy does not like to calculate with NaN values. Here is Numpy trying to calculate the median of the gdp values.

np.median(gdp)
nan

Notice the warning about an invalid value.

Numpy recognizes that one or more values are NaN and refuses to guess what to do, when calculating the median.

You saw from the shape above that gender_data has 263 rows. We can use the general Python len function, to see how many elements there are in gdp.

len(gdp)
216

As expected, it has the same number of elements as there are rows in gender_data.

The count method of the series gives the number of values that are not missing - that is - not NaN.

gdp.count()
200