4.1 Introduction to data frames

Download notebook Interact

Introduction to data frames

Pandas is a Python package that implements data frames, and functions that operate on data frames.

# Load the Pands data science library, call it 'pd'
import pandas as pd

We will also use the usual Numpy array library:

# Load the Numpy array library, call it 'np'
import numpy as np

Data frames and series

We start by loading data from a Comma Separated Value file (CSV file). If you are running on your laptop, you should download the gender_stats.csv file to the same directory as this notebook.

See the gender statistics description page for more detail on the dataset.

# Load the data file
gender_data = pd.read_csv('gender_stats.csv')

This is our usual assignment statement. The LHS is gender_data, the variable name. The RHS is an expression, that returns a value.

What type of value does it return?

type(gender_data)

pandas.core.frame.DataFrame

Pandas integrates with the Notebook, so, if you display a data frame in the notebook, it does a nice display of rows and columns.

gender_data

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
0	Aruba	ABW	1.663250	NaN	NaN	NaN	48.721939	NaN	0.103744
1	Afghanistan	AFG	4.954500	19.961015	161.138034	2.834598	40.109708	444.00	32.715838
2	Angola	AGO	6.123000	111.936542	254.747970	2.447546	NaN	501.25	26.937545
3	Albania	ALB	1.769250	12.327586	574.202694	2.836021	47.201082	29.25	2.888280
4	Andorra	AND	NaN	3.197538	4421.224933	7.260281	47.123345	NaN	0.079547
5	United Arab Emirates	ARE	1.793000	375.027082	2202.407569	2.581168	48.789260	6.00	9.080299
6	Argentina	ARG	2.328000	550.980968	1148.256142	2.782216	48.915810	53.75	42.976675
7	Armenia	ARM	1.545500	10.885362	348.663884	1.916016	46.782180	27.25	2.904683
8	American Samoa	ASM	NaN	0.640500	NaN	NaN	NaN	NaN	0.055422
9	Antigua and Barbuda	ATG	2.082000	1.298213	1152.493656	3.676514	48.291463	NaN	0.098872
10	Australia	AUS	1.861500	1422.994116	4256.058988	6.292381	48.576707	6.00	23.444560
11	Austria	AUT	1.455000	407.494276	4930.298893	8.504276	48.556078	4.00	8.566294
12	Azerbaijan	AZE	1.980000	62.003001	956.709718	1.197249	46.157363	25.25	9.531856
13	Burundi	BDI	5.992250	2.876978	59.693830	4.433749	50.481971	747.25	9.907015
14	Belgium	BEL	1.755000	494.221836	4297.838005	8.221003	48.864675	7.00	11.228495
15	Benin	BEN	4.806750	8.778151	83.726190	2.206916	47.211127	417.50	10.293715
16	Burkina Faso	BFA	5.564500	11.753054	87.143009	3.021246	48.217520	384.25	17.597395
17	Bangladesh	BGD	2.193250	174.545099	85.968844	0.860447	50.460564	194.75	159.371214
18	Bulgaria	BGR	1.510000	53.797612	1266.857003	4.243437	48.310802	10.75	7.220151
19	Bahrain	BHR	2.065250	32.004011	2030.158316	2.976386	49.116838	15.25	1.349810
20	Bahamas, The	BHS	1.877250	8.688000	1727.128385	3.308626	NaN	81.50	0.381904
21	Bosnia and Herzegovina	BIH	1.267000	17.323327	941.504655	6.841021	48.634905	11.75	3.574396
22	Belarus	BLR	1.677000	64.782942	986.236757	3.876601	48.685741	4.00	9.480348
23	Belize	BLZ	2.594750	1.680325	471.967465	3.744844	48.317238	29.25	0.351764
24	Bermuda	BMU	1.617500	5.555624	NaN	NaN	48.423588	NaN	0.065101
25	Bolivia	BOL	2.995250	31.509324	381.007594	4.192031	48.464175	218.25	10.562803
26	Brazil	BRA	1.795250	2198.765606	1303.199104	3.773473	47.784577	49.50	204.159544
27	Barbados	BRB	1.792250	4.413080	1062.840088	4.828680	48.878181	28.00	0.283338
28	Brunei Darussalam	BRN	1.884000	15.719223	1795.924160	2.335194	48.523699	23.75	0.411581
29	Bhutan	BTN	2.061250	1.975145	277.526670	2.706908	49.572296	161.75	0.775905
...	...	...	...	...	...	...	...	...	...
186	Turks and Caicos Islands	TCA	NaN	NaN	NaN	NaN	48.846884	NaN	0.033703
187	Chad	TCD	6.211000	11.945942	69.904561	1.725194	42.929245	892.25	13.574024
188	Togo	TGO	4.620000	4.183610	71.263825	2.037809	48.270471	380.75	7.230904
189	Thailand	THA	1.516750	406.136904	581.927487	3.183842	48.213034	21.00	68.384986
190	Tajikistan	TJK	3.495750	8.036228	169.745970	1.976367	48.260680	33.25	8.363844
191	Turkmenistan	TKM	2.313750	37.973096	288.572644	1.349303	48.906879	43.50	5.465637
192	Timor-Leste	TLS	5.797750	1.361430	98.577296	1.140440	48.337367	240.25	1.212718
193	Tonga	TON	3.745750	0.439179	250.962504	3.987285	47.697931	129.25	0.105909
194	Trinidad and Tobago	TTO	1.782750	24.570947	1778.148073	3.071370	NaN	63.25	1.353877
195	Tunisia	TUN	2.140000	44.824374	782.950522	4.118771	48.142132	63.25	11.144409
196	Turkey	TUR	2.078000	895.175577	997.374772	4.189521	48.789477	17.50	77.034345
197	Tuvalu	TUV	NaN	0.036470	563.500592	15.506929	47.472414	NaN	0.010910
198	Tanzania	TZA	5.181250	44.935542	131.704162	2.648609	50.666580	429.50	52.281324
199	Uganda	UGA	5.822500	25.941461	132.892684	2.014349	50.099485	366.50	38.865339
200	Ukraine	UKR	1.510250	135.379275	628.579254	3.960185	48.984198	24.25	45.302704
201	Uruguay	URY	2.027000	54.345132	1721.507752	6.044403	48.295555	15.50	3.419977
202	United States	USA	1.860875	17369.124600	9060.068657	8.121961	48.758830	14.00	318.558175
203	Uzbekistan	UZB	2.372750	61.340649	334.476754	3.118842	48.387434	37.00	30.784500
204	St. Vincent and the Grenadines	VCT	1.986000	0.730107	775.803386	4.365757	48.536415	45.75	0.109421
205	Venezuela, RB	VEN	2.378250	376.146268	896.815314	1.587088	48.400934	97.00	30.734524
206	British Virgin Islands	VGB	NaN	NaN	NaN	NaN	47.581520	NaN	0.029585
207	Virgin Islands (U.S.)	VIR	1.760000	3.812000	NaN	NaN	NaN	NaN	0.104141
208	Vietnam	VNM	1.959500	181.820736	368.374550	3.779501	48.021053	54.75	90.742400
209	Vanuatu	VUT	3.364750	0.782876	125.568712	3.689874	47.301617	82.50	0.258896
210	Samoa	WSM	4.118250	0.799887	366.353096	5.697059	48.350049	54.75	0.192225
211	Kosovo	XKX	2.142500	6.804620	NaN	NaN	NaN	NaN	1.813820
212	Yemen, Rep.	YEM	4.225750	36.819337	207.949700	1.417836	44.470076	399.75	26.246608
213	South Africa	ZAF	2.375250	345.209888	1123.142656	4.241441	48.516298	143.75	54.177209
214	Zambia	ZMB	5.394250	24.280990	185.556359	2.687290	49.934484	233.75	15.633220
215	Zimbabwe	ZWE	3.943000	15.495514	115.519881	2.695188	49.529875	398.00	15.420964

216 rows × 9 columns

The data frame has rows and columns. Like other Python objects, it has attributes. These are pieces of data associated with the data frame. You have already seen methods, which are functions associated with the data frame. You can access attributes in the same way as you access methods, by typing the variable name, followed by a dot ., followed by the attribute name.

For example, one attribute of the data frame, is the shape:

gender_data.shape

(216, 9)

Another attribute is columns. This has the column names. For example, here is a good way of quickly seeing the column names for a data frame:

gender_data.columns

Index(['country_name', 'country_code', 'fert_rate', 'gdp_us_billion',
       'health_exp_per_cap', 'health_exp_pub', 'prim_ed_girls',
       'mat_mort_ratio', 'population'],
      dtype='object')

You need more information about what these column names refer to. Here are the longer descriptions from the original data source (link above):

fert_rate: Fertility rate, total (births per woman).
gdp_us_billion: GDP (in current US \$ billions).
health_exp_per_cap: Health expenditure per capita, PPP (constant 2011 international \$).
health_exp_pub: Health expenditure, public (% of GDP).
prim_ed_girls: Primary education, pupils (% female).
mat_mort_ratio: Maternal mortality ratio (modeled estimate, per 100,000 live births).
population: Population, total.

You have just seen array slicing (in Selecting with arrays. You remember that array slicing uses square brackets. Data frames also allow slicing. For example, we often want to get all the data for a single column of the data frame. To do this, we use the same square bracket notation as we use for array slicing, with the name of the column inside the square brackets.

gdp = gender_data['gdp_us_billion']

What type of thing is this column of data?

type(gdp)

pandas.core.series.Series

Here are the values for gdp. You will notice that these are the same values you saw in the “gdp_us_billion” column when you displayed the whole data frame.

gdp

             NaN
       19.961015
      111.936542
       12.327586
        3.197538
      375.027082
      550.980968
       10.885362
        0.640500
        1.298213
    1422.994116
     407.494276
      62.003001
       2.876978
     494.221836
       8.778151
      11.753054
     174.545099
      53.797612
      32.004011
       8.688000
      17.323327
      64.782942
       1.680325
       5.555624
      31.509324
    2198.765606
       4.413080
      15.719223
       1.975145
           ...     
           NaN
     11.945942
      4.183610
    406.136904
      8.036228
     37.973096
      1.361430
      0.439179
     24.570947
     44.824374
    895.175577
      0.036470
     44.935542
     25.941461
    135.379275
     54.345132
  17369.124600
     61.340649
      0.730107
    376.146268
           NaN
      3.812000
    181.820736
      0.782876
      0.799887
      6.804620
     36.819337
    345.209888
     24.280990
     15.495514
Name: gdp_us_billion, Length: 216, dtype: float64

Missing values and `NaN`

Looking at the values of gdp (and therefore, the values of the gdp_us_billion column of gender_data, we see that some of the values are NaN, which means Not a Number. Pandas uses this marker to indicate values that are not available, or missing data.

Numpy does not like to calculate with NaN values. Here is Numpy trying to calculate the median of the gdp values.

np.median(gdp)

/Users/mb312/Library/Python/3.7/lib/python/site-packages/numpy/lib/function_base.py:3250: RuntimeWarning: Invalid value encountered in median
  r = func(a, **kwargs)

nan

Notice the warning about an invalid value.

Numpy recognizes that one or more values are NaN and refuses to guess what to do, when calculating the median.

You saw from the shape above that gender_data has 263 rows. We can use the general Python len function, to see how many elements there are in gdp.

len(gdp)

As expected, it has the same number of elements as there are rows in gender_data.

The count method of the series gives the number of values that are not missing - that is - not NaN.

gdp.count()

Plotting with methods

We start with the magic incantation to load the plotting library.

# Load the library for plotting, name it 'plt'
import matplotlib.pyplot as plt
# Display plots inside the notebook.
%matplotlib inline
# Make plots look a little more fancy
plt.style.use('fivethirtyeight')

The gdp variable is a sequence of values, so we can do a histogram on these values, as we have done histograms on arrays.

plt.hist(gdp);

/Users/mb312/Library/Python/3.7/lib/python/site-packages/numpy/lib/histograms.py:754: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
/Users/mb312/Library/Python/3.7/lib/python/site-packages/numpy/lib/histograms.py:755: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)

png

Notice the multiple warnings as Matplotlib tried to calculate the bin widths for the histogram. These warnings result from Matplotlib struggline with NaN values.

Another way to do the histogram, is to use the hist method of the series.

A method is a function attached to a value. In this case hist is a function attached to a value of type Series.

Using the hist method instead of the plt.hist function can make the code a bit easier to read. The method also has the advantage that it discards the NaN values, by default, so it does not generate the same warnings.

gdp.hist();

png

Now we have had a look at the GDP values, we will look at the values for the mat_mort_ratio column. These are the numbers of women who die in childbirth for every 100,000 births.

mmr = gender_data['mat_mort_ratio']
mmr

       NaN
    444.00
    501.25
     29.25
       NaN
      6.00
     53.75
     27.25
       NaN
       NaN
     6.00
     4.00
    25.25
   747.25
     7.00
   417.50
   384.25
   194.75
    10.75
    15.25
    81.50
    11.75
     4.00
    29.25
      NaN
   218.25
    49.50
    28.00
    23.75
   161.75
        ...  
     NaN
  892.25
  380.75
   21.00
   33.25
   43.50
  240.25
  129.25
   63.25
   63.25
   17.50
     NaN
  429.50
  366.50
   24.25
   15.50
   14.00
   37.00
   45.75
   97.00
     NaN
     NaN
   54.75
   82.50
   54.75
     NaN
  399.75
  143.75
  233.75
  398.00
Name: mat_mort_ratio, Length: 216, dtype: float64

mmr.hist();

png

We are interested in the relationship of gpp and mmr. Maybe richer countries have better health care, and fewer maternal deaths.

Here is a plot, using the standard Matplotlib scatter function.

plt.scatter(gdp, mmr);

png

We can do the same plot using the plot.scatter method on the data frame. In that case, we specify the column names that should go on the x and the y axes.

gender_data.plot.scatter('gdp_us_billion', 'mat_mort_ratio');

png

An advantage of doing it this way is that we get the column names on the x and y axes by default.

Showing the top 5 values with the `head` method

We have already seen that Pandas will display the data frame with nice formatting. If the data frame is long, it will display only the first few and the last few rows:

gender_data

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
0	Aruba	ABW	1.663250	NaN	NaN	NaN	48.721939	NaN	0.103744
1	Afghanistan	AFG	4.954500	19.961015	161.138034	2.834598	40.109708	444.00	32.715838
2	Angola	AGO	6.123000	111.936542	254.747970	2.447546	NaN	501.25	26.937545
3	Albania	ALB	1.769250	12.327586	574.202694	2.836021	47.201082	29.25	2.888280
4	Andorra	AND	NaN	3.197538	4421.224933	7.260281	47.123345	NaN	0.079547
5	United Arab Emirates	ARE	1.793000	375.027082	2202.407569	2.581168	48.789260	6.00	9.080299
6	Argentina	ARG	2.328000	550.980968	1148.256142	2.782216	48.915810	53.75	42.976675
7	Armenia	ARM	1.545500	10.885362	348.663884	1.916016	46.782180	27.25	2.904683
8	American Samoa	ASM	NaN	0.640500	NaN	NaN	NaN	NaN	0.055422
9	Antigua and Barbuda	ATG	2.082000	1.298213	1152.493656	3.676514	48.291463	NaN	0.098872
10	Australia	AUS	1.861500	1422.994116	4256.058988	6.292381	48.576707	6.00	23.444560
11	Austria	AUT	1.455000	407.494276	4930.298893	8.504276	48.556078	4.00	8.566294
12	Azerbaijan	AZE	1.980000	62.003001	956.709718	1.197249	46.157363	25.25	9.531856
13	Burundi	BDI	5.992250	2.876978	59.693830	4.433749	50.481971	747.25	9.907015
14	Belgium	BEL	1.755000	494.221836	4297.838005	8.221003	48.864675	7.00	11.228495
15	Benin	BEN	4.806750	8.778151	83.726190	2.206916	47.211127	417.50	10.293715
16	Burkina Faso	BFA	5.564500	11.753054	87.143009	3.021246	48.217520	384.25	17.597395
17	Bangladesh	BGD	2.193250	174.545099	85.968844	0.860447	50.460564	194.75	159.371214
18	Bulgaria	BGR	1.510000	53.797612	1266.857003	4.243437	48.310802	10.75	7.220151
19	Bahrain	BHR	2.065250	32.004011	2030.158316	2.976386	49.116838	15.25	1.349810
20	Bahamas, The	BHS	1.877250	8.688000	1727.128385	3.308626	NaN	81.50	0.381904
21	Bosnia and Herzegovina	BIH	1.267000	17.323327	941.504655	6.841021	48.634905	11.75	3.574396
22	Belarus	BLR	1.677000	64.782942	986.236757	3.876601	48.685741	4.00	9.480348
23	Belize	BLZ	2.594750	1.680325	471.967465	3.744844	48.317238	29.25	0.351764
24	Bermuda	BMU	1.617500	5.555624	NaN	NaN	48.423588	NaN	0.065101
25	Bolivia	BOL	2.995250	31.509324	381.007594	4.192031	48.464175	218.25	10.562803
26	Brazil	BRA	1.795250	2198.765606	1303.199104	3.773473	47.784577	49.50	204.159544
27	Barbados	BRB	1.792250	4.413080	1062.840088	4.828680	48.878181	28.00	0.283338
28	Brunei Darussalam	BRN	1.884000	15.719223	1795.924160	2.335194	48.523699	23.75	0.411581
29	Bhutan	BTN	2.061250	1.975145	277.526670	2.706908	49.572296	161.75	0.775905
...	...	...	...	...	...	...	...	...	...
186	Turks and Caicos Islands	TCA	NaN	NaN	NaN	NaN	48.846884	NaN	0.033703
187	Chad	TCD	6.211000	11.945942	69.904561	1.725194	42.929245	892.25	13.574024
188	Togo	TGO	4.620000	4.183610	71.263825	2.037809	48.270471	380.75	7.230904
189	Thailand	THA	1.516750	406.136904	581.927487	3.183842	48.213034	21.00	68.384986
190	Tajikistan	TJK	3.495750	8.036228	169.745970	1.976367	48.260680	33.25	8.363844
191	Turkmenistan	TKM	2.313750	37.973096	288.572644	1.349303	48.906879	43.50	5.465637
192	Timor-Leste	TLS	5.797750	1.361430	98.577296	1.140440	48.337367	240.25	1.212718
193	Tonga	TON	3.745750	0.439179	250.962504	3.987285	47.697931	129.25	0.105909
194	Trinidad and Tobago	TTO	1.782750	24.570947	1778.148073	3.071370	NaN	63.25	1.353877
195	Tunisia	TUN	2.140000	44.824374	782.950522	4.118771	48.142132	63.25	11.144409
196	Turkey	TUR	2.078000	895.175577	997.374772	4.189521	48.789477	17.50	77.034345
197	Tuvalu	TUV	NaN	0.036470	563.500592	15.506929	47.472414	NaN	0.010910
198	Tanzania	TZA	5.181250	44.935542	131.704162	2.648609	50.666580	429.50	52.281324
199	Uganda	UGA	5.822500	25.941461	132.892684	2.014349	50.099485	366.50	38.865339
200	Ukraine	UKR	1.510250	135.379275	628.579254	3.960185	48.984198	24.25	45.302704
201	Uruguay	URY	2.027000	54.345132	1721.507752	6.044403	48.295555	15.50	3.419977
202	United States	USA	1.860875	17369.124600	9060.068657	8.121961	48.758830	14.00	318.558175
203	Uzbekistan	UZB	2.372750	61.340649	334.476754	3.118842	48.387434	37.00	30.784500
204	St. Vincent and the Grenadines	VCT	1.986000	0.730107	775.803386	4.365757	48.536415	45.75	0.109421
205	Venezuela, RB	VEN	2.378250	376.146268	896.815314	1.587088	48.400934	97.00	30.734524
206	British Virgin Islands	VGB	NaN	NaN	NaN	NaN	47.581520	NaN	0.029585
207	Virgin Islands (U.S.)	VIR	1.760000	3.812000	NaN	NaN	NaN	NaN	0.104141
208	Vietnam	VNM	1.959500	181.820736	368.374550	3.779501	48.021053	54.75	90.742400
209	Vanuatu	VUT	3.364750	0.782876	125.568712	3.689874	47.301617	82.50	0.258896
210	Samoa	WSM	4.118250	0.799887	366.353096	5.697059	48.350049	54.75	0.192225
211	Kosovo	XKX	2.142500	6.804620	NaN	NaN	NaN	NaN	1.813820
212	Yemen, Rep.	YEM	4.225750	36.819337	207.949700	1.417836	44.470076	399.75	26.246608
213	South Africa	ZAF	2.375250	345.209888	1123.142656	4.241441	48.516298	143.75	54.177209
214	Zambia	ZMB	5.394250	24.280990	185.556359	2.687290	49.934484	233.75	15.633220
215	Zimbabwe	ZWE	3.943000	15.495514	115.519881	2.695188	49.529875	398.00	15.420964

216 rows × 9 columns

Notice the ... in the center of this listing, to show that it has not printed some rows.

Sometimes we do not want to see all these rows, but only - say - the top five rows. The head method of the data frame is a useful way to do this:

gender_data.head()

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
0	Aruba	ABW	1.66325	NaN	NaN	NaN	48.721939	NaN	0.103744
1	Afghanistan	AFG	4.95450	19.961015	161.138034	2.834598	40.109708	444.00	32.715838
2	Angola	AGO	6.12300	111.936542	254.747970	2.447546	NaN	501.25	26.937545
3	Albania	ALB	1.76925	12.327586	574.202694	2.836021	47.201082	29.25	2.888280
4	Andorra	AND	NaN	3.197538	4421.224933	7.260281	47.123345	NaN	0.079547

The Series also has a head method, that does the same thing:

gdp.head()

         NaN
   19.961015
  111.936542
   12.327586
    3.197538
Name: gdp_us_billion, dtype: float64

Selecting rows

We often want to select rows from the data frame that match some criterion.

Say we want to select the rows corresponding the countries with a high GDP.

Looking at the histogram of gdp above, we could try this as a threshold to identify high GDP countries.

high_gdp = gdp > 1000
high_gdp.head()

  False
  False
  False
  False
  False
Name: gdp_us_billion, dtype: bool

type(high_gdp)

pandas.core.series.Series

Notice that high_gdp is a Boolean series, like the Boolean arrays you have already seen. It has True for elements corresponding to countries with gdp value greater than 1000 and False otherwise.

We can use this Boolean series to select rows from the data frame.

Remember indexing. When we follow a name of something, like an array or series or data frame, with an open square bracket, this means we are trying to get data from the array or Series. The stuff inside the square brackets says what we want.

When we put our Boolean series inside the square brackets, it works like this:

rich_gender_data = gender_data[high_gdp]
rich_gender_data

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
10	Australia	AUS	1.861500	1422.994116	4256.058988	6.292381	48.576707	6.00	23.444560
26	Brazil	BRA	1.795250	2198.765606	1303.199104	3.773473	47.784577	49.50	204.159544
32	Canada	CAN	1.600300	1708.473627	4616.539134	7.546247	48.808926	7.25	35.517119
35	China	CHN	1.558750	10182.790479	657.748859	3.015530	46.297964	28.75	1364.446000
49	Germany	DEU	1.450000	3601.226158	4909.659884	8.542615	48.568695	6.25	81.281645
58	Spain	ESP	1.307500	1299.724261	2963.832825	6.545739	48.722231	5.00	46.553128
63	France	FRA	2.005000	2647.649725	4387.835406	8.920420	48.772050	8.75	66.302099
67	United Kingdom	GBR	1.842500	2768.864417	3357.983675	7.720655	48.791809	9.25	64.641557
88	India	IND	2.449250	2019.005411	241.572477	1.292666	49.497234	185.25	1293.742537
94	Italy	ITA	1.390000	2005.983980	3266.984094	6.984374	48.407573	4.00	60.378795
97	Japan	JPN	1.430000	5106.024760	3687.126279	8.496074	48.744199	5.75	127.297102
104	Korea, Rep.	KOR	1.232000	1346.751162	2385.447251	3.915606	48.023388	12.00	50.727212
124	Mexico	MEX	2.257000	1188.802780	1081.208948	3.225839	48.906296	40.00	124.203450
164	Russian Federation	RUS	1.724500	1822.691700	1755.506635	3.731354	48.968070	25.25	143.793504
202	United States	USA	1.860875	17369.124600	9060.068657	8.121961	48.758830	14.00	318.558175

type(rich_gender_data)

pandas.core.frame.DataFrame

rich_gender_data is a new data frame, that is a subset of the original gender_data frame. It contains only the rows where the GDP value is greater than 1000 billion dollars. Check the display of rich_gender_data above to confirm that the values in the gdp_us_billion column are all greater than 1000.

We can do a scatter plot of GDP values against maternal mortality rate, and we find, oddly, that for rich countries, there is little relationship between GDP and maternal mortality.

rich_gender_data.plot.scatter('gdp_us_billion', 'mat_mort_ratio')

<matplotlib.axes._subplots.AxesSubplot at 0x1150beb90>

png

One thing that stands out is the very high value at around 180. Which country does this refer to? We can use sorting to find out.

Sorting data frames

Data frames have a method sort_value. This returns a new data frame with the rows sorted by the values in the column we specify.

Here are the first five rows of the data frame of the rich countries:

rich_gender_data.head()

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
10	Australia	AUS	1.86150	1422.994116	4256.058988	6.292381	48.576707	6.00	23.444560
26	Brazil	BRA	1.79525	2198.765606	1303.199104	3.773473	47.784577	49.50	204.159544
32	Canada	CAN	1.60030	1708.473627	4616.539134	7.546247	48.808926	7.25	35.517119
35	China	CHN	1.55875	10182.790479	657.748859	3.015530	46.297964	28.75	1364.446000
49	Germany	DEU	1.45000	3601.226158	4909.659884	8.542615	48.568695	6.25	81.281645

We are interested to find which of these richer countries has a high maternal mortality ratio. To do this, we can make a new data frame where the rows are sorted by the values in the mat_mort_ratio column:

rich_by_mmr = rich_gender_data.sort_values('mat_mort_ratio')
rich_by_mmr.head()

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
94	Italy	ITA	1.3900	2005.983980	3266.984094	6.984374	48.407573	4.00	60.378795
58	Spain	ESP	1.3075	1299.724261	2963.832825	6.545739	48.722231	5.00	46.553128
97	Japan	JPN	1.4300	5106.024760	3687.126279	8.496074	48.744199	5.75	127.297102
10	Australia	AUS	1.8615	1422.994116	4256.058988	6.292381	48.576707	6.00	23.444560
49	Germany	DEU	1.4500	3601.226158	4909.659884	8.542615	48.568695	6.25	81.281645

Notice that the rows are in ascending order of mat_mort_ratio. To find the countries with high maternal mortality, we might prefer to sort in descending order. As usual you can explore how you might do this by looking at the help for the sort_values method with:

rich_by_mmr.sort_values?

in a new cell. If you do that, you discover the ascending argument, that you can use like this:

rich_by_descending_mmr = rich_gender_data.sort_values('mat_mort_ratio', ascending=False)
rich_by_descending_mmr.head()

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
88	India	IND	2.44925	2019.005411	241.572477	1.292666	49.497234	185.25	1293.742537
26	Brazil	BRA	1.79525	2198.765606	1303.199104	3.773473	47.784577	49.50	204.159544
124	Mexico	MEX	2.25700	1188.802780	1081.208948	3.225839	48.906296	40.00	124.203450
35	China	CHN	1.55875	10182.790479	657.748859	3.015530	46.297964	28.75	1364.446000
164	Russian Federation	RUS	1.72450	1822.691700	1755.506635	3.731354	48.968070	25.25	143.793504

As you might have guessed by now, Series also have a sort_values method. For a Series, you don’t have to specify the column to sort from, because you are using the Series values.

rich_mmr = rich_gender_data['mat_mort_ratio']
type(rich_mmr)

pandas.core.series.Series

rich_mmr.sort_values(ascending=False)

   185.25
    49.50
   40.00
    28.75
   25.25
   14.00
   12.00
     9.25
     8.75
     7.25
     6.25
     6.00
     5.75
     5.00
     4.00
Name: mat_mort_ratio, dtype: float64

Calculation on data frames

We can calculate with Pandas Series, just as we can with arrays.

For example, now we know that India has both a high GDP, and a high maternal mortality ratio, we wonder whether this is because India also has a large population, and therefore, relatively little money per person to spend on health care.

So, we would like know the GDP per capita. Luckily the data frame as a column “population”:

rich_population = rich_by_descending_mmr["population"]
rich_population

   1293.742537
    204.159544
   124.203450
   1364.446000
   143.793504
   318.558175
    50.727212
     64.641557
     66.302099
     35.517119
     81.281645
     23.444560
    127.297102
     46.553128
     60.378795
Name: population, dtype: float64

We can divide the GDP by the population in millions to get US billion dollars per million population.

This works exactly as it does for arrays:

rich_gdp = rich_by_descending_mmr["gdp_us_billion"]
rich_gdp

    2019.005411
    2198.765606
   1188.802780
   10182.790479
   1822.691700
  17369.124600
   1346.751162
    2768.864417
    2647.649725
    1708.473627
    3601.226158
    1422.994116
    5106.024760
    1299.724261
    2005.983980
Name: gdp_us_billion, dtype: float64

gdp_per_million = rich_gdp / rich_population
gdp_per_million

    1.560593
   10.769840
   9.571415
    7.462949
  12.675758
  54.524184
  26.548890
   42.834123
   39.933121
   48.102821
   44.305528
   60.696133
   40.111084
   27.919161
   33.223319
dtype: float64

Notice that the result is elementwise division, that is Python divides each element in rich_gdp by the corresponding element in rich_population.

Remember that India is the first country in the rich_by_descending_mmr data frame. It also has by far the lowest GDP per million population of any of this selection of rich countries. Here’s a plot of gdp_per_million against the corresponding values in mat_mort_ratio:

plt.scatter(gdp_per_million, rich_by_descending_mmr['mat_mort_ratio'])

<matplotlib.collections.PathCollection at 0x114251dd0>

png

It does look as if low income per person predisposes to high maternal mortality.