4.1 Introduction to data frames

Download notebook Interact

Introduction to data frames

Start by loading the usual plotting libraries.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Make plots look a little bit more fancy
plt.style.use('fivethirtyeight')

Pandas is a Python package that implements data frames, and functions that operate on data frames.

import pandas as pd

Data frames and series

We start by loading data from a Comma Separated Value file (CSV file). If you are running on your laptop, you should download the gender_stats.csv file to the same directory as this notebook.

# Load the data file
gender_data = pd.read_csv('gender_stats.csv')

This is our usual assignment statement. The LHS is gender_data, the variable name. The RHS is an expression, that returns a value.

What type of value does it return?

type(gender_data)

pandas.core.frame.DataFrame

Pandas integrates with the Notebook, so, if you display a data frame in the notebook, it does a nice display of rows and columns.

gender_data

	country	fert_rate	gdp	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
0	Afghanistan	4.954500	1.996102e+10	161.138034	2.834598	40.109708	444.00	3.271584e+07
1	Albania	1.769250	1.232759e+10	574.202694	2.836021	47.201082	29.25	2.888280e+06
2	Algeria	2.866000	1.907346e+11	870.766508	4.984252	47.675617	142.50	3.909906e+07
3	American Samoa	NaN	6.405000e+08	NaN	NaN	NaN	NaN	5.542200e+04
4	Andorra	NaN	3.197538e+09	4421.224933	7.260281	47.123345	NaN	7.954740e+04
5	Angola	6.123000	1.119365e+11	254.747970	2.447546	NaN	501.25	2.693754e+07
6	Antigua and Barbuda	2.082000	1.298213e+09	1152.493656	3.676514	48.291463	NaN	9.887240e+04
7	Arab World	3.397587	2.709059e+12	761.401727	2.873840	47.119776	161.00	3.899620e+08
8	Argentina	2.328000	5.509810e+11	1148.256142	2.782216	48.915810	53.75	4.297667e+07
9	Armenia	1.545500	1.088536e+10	348.663884	1.916016	46.782180	27.25	2.904683e+06
10	Aruba	1.663250	NaN	NaN	NaN	48.721939	NaN	1.037444e+05
11	Australia	1.861500	1.422994e+12	4256.058988	6.292381	48.576707	6.00	2.344456e+07
12	Austria	1.455000	4.074943e+11	4930.298893	8.504276	48.556078	4.00	8.566294e+06
13	Azerbaijan	1.980000	6.200300e+10	956.709718	1.197249	46.157363	25.25	9.531856e+06
14	Bahamas, The	1.877250	8.688000e+09	1727.128385	3.308626	NaN	81.50	3.819036e+05
15	Bahrain	2.065250	3.200401e+10	2030.158316	2.976386	49.116838	15.25	1.349810e+06
16	Bangladesh	2.193250	1.745451e+11	85.968844	0.860447	50.460564	194.75	1.593712e+08
17	Barbados	1.792250	4.413080e+09	1062.840088	4.828680	48.878181	28.00	2.833384e+05
18	Belarus	1.677000	6.478294e+10	986.236757	3.876601	48.685741	4.00	9.480348e+06
19	Belgium	1.755000	4.942218e+11	4297.838005	8.221003	48.864675	7.00	1.122850e+07
20	Belize	2.594750	1.680325e+09	471.967465	3.744844	48.317238	29.25	3.517636e+05
21	Benin	4.806750	8.778151e+09	83.726190	2.206916	47.211127	417.50	1.029371e+07
22	Bermuda	1.617500	5.555624e+09	NaN	NaN	48.423588	NaN	6.510080e+04
23	Bhutan	2.061250	1.975145e+09	277.526670	2.706908	49.572296	161.75	7.759054e+05
24	Bolivia	2.995250	3.150932e+10	381.007594	4.192031	48.464175	218.25	1.056280e+07
25	Bosnia and Herzegovina	1.267000	1.732333e+10	941.504655	6.841021	48.634905	11.75	3.574396e+06
26	Botswana	2.845000	1.511339e+10	880.909202	3.552071	48.844009	138.75	2.169170e+06
27	Brazil	1.795250	2.198766e+12	1303.199104	3.773473	47.784577	49.50	2.041595e+08
28	British Virgin Islands	NaN	NaN	NaN	NaN	47.581520	NaN	2.958540e+04
29	Brunei Darussalam	1.884000	1.571922e+10	1795.924160	2.335194	48.523699	23.75	4.115812e+05
...	...	...	...	...	...	...	...	...
233	Syrian Arab Republic	2.967750	NaN	269.945739	1.507166	48.047394	62.00	1.931967e+07
234	Tajikistan	3.495750	8.036228e+09	169.745970	1.976367	48.260680	33.25	8.363844e+06
235	Tanzania	5.181250	4.493554e+10	131.704162	2.648609	50.666580	429.50	5.228132e+07
236	Thailand	1.516750	4.061369e+11	581.927487	3.183842	48.213034	21.00	6.838499e+07
237	Timor-Leste	5.797750	1.361430e+09	98.577296	1.140440	48.337367	240.25	1.212718e+06
238	Togo	4.620000	4.183610e+09	71.263825	2.037809	48.270471	380.75	7.230904e+06
239	Tonga	3.745750	4.391789e+08	250.962504	3.987285	47.697931	129.25	1.059094e+05
240	Trinidad and Tobago	1.782750	2.457095e+10	1778.148073	3.071370	NaN	63.25	1.353877e+06
241	Tunisia	2.140000	4.482437e+10	782.950522	4.118771	48.142132	63.25	1.114441e+07
242	Turkey	2.078000	8.951756e+11	997.374772	4.189521	48.789477	17.50	7.703435e+07
243	Turkmenistan	2.313750	3.797310e+10	288.572644	1.349303	48.906879	43.50	5.465637e+06
244	Turks and Caicos Islands	NaN	NaN	NaN	NaN	48.846884	NaN	3.370340e+04
245	Tuvalu	NaN	3.646999e+07	563.500592	15.506929	47.472414	NaN	1.091000e+04
246	Uganda	5.822500	2.594146e+10	132.892684	2.014349	50.099485	366.50	3.886534e+07
247	Ukraine	1.510250	1.353793e+11	628.579254	3.960185	48.984198	24.25	4.530270e+07
248	United Arab Emirates	1.793000	3.750271e+11	2202.407569	2.581168	48.789260	6.00	9.080299e+06
249	United Kingdom	1.842500	2.768864e+12	3357.983675	7.720655	48.791809	9.25	6.464156e+07
250	United States	1.860875	1.736912e+13	9060.068657	8.121961	48.758830	14.00	3.185582e+08
251	Upper middle income	1.795244	2.097441e+13	870.897512	3.358153	47.112001	43.25	2.540966e+09
252	Uruguay	2.027000	5.434513e+10	1721.507752	6.044403	48.295555	15.50	3.419977e+06
253	Uzbekistan	2.372750	6.134065e+10	334.476754	3.118842	48.387434	37.00	3.078450e+07
254	Vanuatu	3.364750	7.828760e+08	125.568712	3.689874	47.301617	82.50	2.588964e+05
255	Venezuela, RB	2.378250	3.761463e+11	896.815314	1.587088	48.400934	97.00	3.073452e+07
256	Vietnam	1.959500	1.818207e+11	368.374550	3.779501	48.021053	54.75	9.074240e+07
257	Virgin Islands (U.S.)	1.760000	3.812000e+09	NaN	NaN	NaN	NaN	1.041414e+05
258	West Bank and Gaza	4.208000	1.250822e+10	NaN	NaN	48.828520	47.50	4.296960e+06
259	World	2.464282	7.613006e+13	1223.941243	5.947058	48.076575	223.75	7.269321e+09
260	Yemen, Rep.	4.225750	3.681934e+10	207.949700	1.417836	44.470076	399.75	2.624661e+07
261	Zambia	5.394250	2.428099e+10	185.556359	2.687290	49.934484	233.75	1.563322e+07
262	Zimbabwe	3.943000	1.549551e+10	115.519881	2.695188	49.529875	398.00	1.542096e+07

263 rows × 8 columns

The data frame has rows and columns. Like other Python objects, it has attributes. These are pieces of data associated with the data frame. You have already seen methods, which are functions associated with the data frame. You can access attributes in the same way as you access methods, by typing the variable name, followed by a dot ., followed by the attribute name.

For example, one attribute of the data frame, is the shape:

gender_data.shape

(263, 8)

Another attribute is columns. This has the column names. For example, here is a good way of quickly seeing the column names for a data frame:

gender_data.columns

Index(['country', 'fert_rate', 'gdp', 'health_exp_per_cap', 'health_exp_pub',
       'prim_ed_girls', 'mat_mort_ratio', 'population'],
      dtype='object')

You need more information about what these column names refer to. Here are the longer descriptions from the original data source (link above):

fert_rate: Fertility rate, total (births per woman).
gdp: GDP (current US$).
health_exp_per_cap: Health expenditure per capita, PPP (constant 2011 international \$).
health_exp_pub: Health expenditure, public (% of GDP).
prim_ed_girls: Primary education, pupils (% female).
mat_mort_ratio: Maternal mortality ratio (modeled estimate, per 100,000 live births).
population: Population, total.

You have just seen array slicing (in Selecting with arrays. You remember that array slicing uses square brackets. Data frames also allow slicing. For example, we often want to get all the data for a single column of the data frame. To do this, we use the same square bracket notation as we use for array slicing, with the name of the column inside the square brackets.

gdp = gender_data['gdp']

What type of thing is this column of data?

type(gdp)

pandas.core.series.Series

Here are the values for gdp. You will notice that these are the same values you saw in the “gdp” column when you displayed the whole data frame.

gdp

    1.996102e+10
    1.232759e+10
    1.907346e+11
    6.405000e+08
    3.197538e+09
    1.119365e+11
    1.298213e+09
    2.709059e+12
    5.509810e+11
    1.088536e+10
            NaN
   1.422994e+12
   4.074943e+11
   6.200300e+10
   8.688000e+09
   3.200401e+10
   1.745451e+11
   4.413080e+09
   6.478294e+10
   4.942218e+11
   1.680325e+09
   8.778151e+09
   5.555624e+09
   1.975145e+09
   3.150932e+10
   1.732333e+10
   1.511339e+10
   2.198766e+12
            NaN
   1.571922e+10
           ...     
           NaN
  8.036228e+09
  4.493554e+10
  4.061369e+11
  1.361430e+09
  4.183610e+09
  4.391789e+08
  2.457095e+10
  4.482437e+10
  8.951756e+11
  3.797310e+10
           NaN
  3.646999e+07
  2.594146e+10
  1.353793e+11
  3.750271e+11
  2.768864e+12
  1.736912e+13
  2.097441e+13
  5.434513e+10
  6.134065e+10
  7.828760e+08
  3.761463e+11
  1.818207e+11
  3.812000e+09
  1.250822e+10
  7.613006e+13
  3.681934e+10
  2.428099e+10
  1.549551e+10
Name: gdp, Length: 263, dtype: float64

What are these values like 6.405000e+08? These are numbers, in scientific notation. Scientific notation is a compact way of writing very large or very small numbers. The value after e above is the exponent, in this case 08. The number above means $6.405

10^8$. For example, here is $2 * 10^7$:

2e7

20000000.0

Missing values and `NaN`

Looking at the values of gdp (and therefore, the values of the gdp column of gender_data, we see that some of the values are NaN, which means Not a Number. Pandas uses this marker to indicate values that are not available, or missing data.

Numpy does not like to calculate with NaN values. Here is Numpy trying to calculate the median of the gdp values.

np.median(gdp)

nan

Notice the warning about an invalid value.

Numpy recognizes that one or more values are NaN and refuses to guess what to do, when calculating the median.

You saw from the shape above that gender_data has 263 rows. We can use the general Python len function, to see how many elements there are in gdp.

len(gdp)

As expected, it has the same number of elements as there are rows in gender_data.

The count method of the series gives the number of values that are not missing - that is - not NaN.

gdp.count()

Plotting with methods

The gdp variable is a sequence of values, so we can do a histogram on these values, as we have done histograms on arrays.

plt.hist(gdp);

png

Notice the multiple warnings as Matplotlib tried to calculate the bin widths for the histogram. These are from the NaN values.

Another way to do the histogram, is to use the hist method of the series.

A method is a function attached to a value. In this case hist is a function attached to a value of type Series.

Using the hist method instead of the plt.hist function can make the code a bit easier to read. The method also has the advantage that it discards the NaN values, by default, so it does not generate the same warnings.

gdp.hist();

png

Now we have had a look at the GDP values, we will look at the values for the mat_mort_ratio column. These are the numbers of women who die in childbirth for every 100,000 births.

mmr = gender_data['mat_mort_ratio']
mmr

    444.00
     29.25
    142.50
       NaN
       NaN
    501.25
       NaN
    161.00
     53.75
     27.25
      NaN
     6.00
     4.00
    25.25
    81.50
    15.25
   194.75
    28.00
     4.00
     7.00
    29.25
   417.50
      NaN
   161.75
   218.25
    11.75
   138.75
    49.50
      NaN
    23.75
        ...  
   62.00
   33.25
  429.50
   21.00
  240.25
  380.75
  129.25
   63.25
   63.25
   17.50
   43.50
     NaN
     NaN
  366.50
   24.25
    6.00
    9.25
   14.00
   43.25
   15.50
   37.00
   82.50
   97.00
   54.75
     NaN
   47.50
  223.75
  399.75
  233.75
  398.00
Name: mat_mort_ratio, Length: 263, dtype: float64

mmr.hist();

png

We are interested in the relationship of gpp and mmr. Maybe richer countries have better health care, and fewer maternal deaths.

Here is a plot, using the standard Matplotlib scatter function.

plt.scatter(gdp, mmr);

png

We can do the same plot using the plot.scatter method on the data frame. In that case, we specify the column names that should go on the x and the y axes.

gender_data.plot.scatter('gdp', 'mat_mort_ratio');

png

An advantage of doing it this way is that we get the column names on the x and y axes by default.

Showing the top 5 values with the `head` method

We have already seen that Pandas will display the data frame with nice formatting. If the data frame is long, it will display only the first few and the last few rows:

gender_data

	country	fert_rate	gdp	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
0	Afghanistan	4.954500	1.996102e+10	161.138034	2.834598	40.109708	444.00	3.271584e+07
1	Albania	1.769250	1.232759e+10	574.202694	2.836021	47.201082	29.25	2.888280e+06
2	Algeria	2.866000	1.907346e+11	870.766508	4.984252	47.675617	142.50	3.909906e+07
3	American Samoa	NaN	6.405000e+08	NaN	NaN	NaN	NaN	5.542200e+04
4	Andorra	NaN	3.197538e+09	4421.224933	7.260281	47.123345	NaN	7.954740e+04
5	Angola	6.123000	1.119365e+11	254.747970	2.447546	NaN	501.25	2.693754e+07
6	Antigua and Barbuda	2.082000	1.298213e+09	1152.493656	3.676514	48.291463	NaN	9.887240e+04
7	Arab World	3.397587	2.709059e+12	761.401727	2.873840	47.119776	161.00	3.899620e+08
8	Argentina	2.328000	5.509810e+11	1148.256142	2.782216	48.915810	53.75	4.297667e+07
9	Armenia	1.545500	1.088536e+10	348.663884	1.916016	46.782180	27.25	2.904683e+06
10	Aruba	1.663250	NaN	NaN	NaN	48.721939	NaN	1.037444e+05
11	Australia	1.861500	1.422994e+12	4256.058988	6.292381	48.576707	6.00	2.344456e+07
12	Austria	1.455000	4.074943e+11	4930.298893	8.504276	48.556078	4.00	8.566294e+06
13	Azerbaijan	1.980000	6.200300e+10	956.709718	1.197249	46.157363	25.25	9.531856e+06
14	Bahamas, The	1.877250	8.688000e+09	1727.128385	3.308626	NaN	81.50	3.819036e+05
15	Bahrain	2.065250	3.200401e+10	2030.158316	2.976386	49.116838	15.25	1.349810e+06
16	Bangladesh	2.193250	1.745451e+11	85.968844	0.860447	50.460564	194.75	1.593712e+08
17	Barbados	1.792250	4.413080e+09	1062.840088	4.828680	48.878181	28.00	2.833384e+05
18	Belarus	1.677000	6.478294e+10	986.236757	3.876601	48.685741	4.00	9.480348e+06
19	Belgium	1.755000	4.942218e+11	4297.838005	8.221003	48.864675	7.00	1.122850e+07
20	Belize	2.594750	1.680325e+09	471.967465	3.744844	48.317238	29.25	3.517636e+05
21	Benin	4.806750	8.778151e+09	83.726190	2.206916	47.211127	417.50	1.029371e+07
22	Bermuda	1.617500	5.555624e+09	NaN	NaN	48.423588	NaN	6.510080e+04
23	Bhutan	2.061250	1.975145e+09	277.526670	2.706908	49.572296	161.75	7.759054e+05
24	Bolivia	2.995250	3.150932e+10	381.007594	4.192031	48.464175	218.25	1.056280e+07
25	Bosnia and Herzegovina	1.267000	1.732333e+10	941.504655	6.841021	48.634905	11.75	3.574396e+06
26	Botswana	2.845000	1.511339e+10	880.909202	3.552071	48.844009	138.75	2.169170e+06
27	Brazil	1.795250	2.198766e+12	1303.199104	3.773473	47.784577	49.50	2.041595e+08
28	British Virgin Islands	NaN	NaN	NaN	NaN	47.581520	NaN	2.958540e+04
29	Brunei Darussalam	1.884000	1.571922e+10	1795.924160	2.335194	48.523699	23.75	4.115812e+05
...	...	...	...	...	...	...	...	...
233	Syrian Arab Republic	2.967750	NaN	269.945739	1.507166	48.047394	62.00	1.931967e+07
234	Tajikistan	3.495750	8.036228e+09	169.745970	1.976367	48.260680	33.25	8.363844e+06
235	Tanzania	5.181250	4.493554e+10	131.704162	2.648609	50.666580	429.50	5.228132e+07
236	Thailand	1.516750	4.061369e+11	581.927487	3.183842	48.213034	21.00	6.838499e+07
237	Timor-Leste	5.797750	1.361430e+09	98.577296	1.140440	48.337367	240.25	1.212718e+06
238	Togo	4.620000	4.183610e+09	71.263825	2.037809	48.270471	380.75	7.230904e+06
239	Tonga	3.745750	4.391789e+08	250.962504	3.987285	47.697931	129.25	1.059094e+05
240	Trinidad and Tobago	1.782750	2.457095e+10	1778.148073	3.071370	NaN	63.25	1.353877e+06
241	Tunisia	2.140000	4.482437e+10	782.950522	4.118771	48.142132	63.25	1.114441e+07
242	Turkey	2.078000	8.951756e+11	997.374772	4.189521	48.789477	17.50	7.703435e+07
243	Turkmenistan	2.313750	3.797310e+10	288.572644	1.349303	48.906879	43.50	5.465637e+06
244	Turks and Caicos Islands	NaN	NaN	NaN	NaN	48.846884	NaN	3.370340e+04
245	Tuvalu	NaN	3.646999e+07	563.500592	15.506929	47.472414	NaN	1.091000e+04
246	Uganda	5.822500	2.594146e+10	132.892684	2.014349	50.099485	366.50	3.886534e+07
247	Ukraine	1.510250	1.353793e+11	628.579254	3.960185	48.984198	24.25	4.530270e+07
248	United Arab Emirates	1.793000	3.750271e+11	2202.407569	2.581168	48.789260	6.00	9.080299e+06
249	United Kingdom	1.842500	2.768864e+12	3357.983675	7.720655	48.791809	9.25	6.464156e+07
250	United States	1.860875	1.736912e+13	9060.068657	8.121961	48.758830	14.00	3.185582e+08
251	Upper middle income	1.795244	2.097441e+13	870.897512	3.358153	47.112001	43.25	2.540966e+09
252	Uruguay	2.027000	5.434513e+10	1721.507752	6.044403	48.295555	15.50	3.419977e+06
253	Uzbekistan	2.372750	6.134065e+10	334.476754	3.118842	48.387434	37.00	3.078450e+07
254	Vanuatu	3.364750	7.828760e+08	125.568712	3.689874	47.301617	82.50	2.588964e+05
255	Venezuela, RB	2.378250	3.761463e+11	896.815314	1.587088	48.400934	97.00	3.073452e+07
256	Vietnam	1.959500	1.818207e+11	368.374550	3.779501	48.021053	54.75	9.074240e+07
257	Virgin Islands (U.S.)	1.760000	3.812000e+09	NaN	NaN	NaN	NaN	1.041414e+05
258	West Bank and Gaza	4.208000	1.250822e+10	NaN	NaN	48.828520	47.50	4.296960e+06
259	World	2.464282	7.613006e+13	1223.941243	5.947058	48.076575	223.75	7.269321e+09
260	Yemen, Rep.	4.225750	3.681934e+10	207.949700	1.417836	44.470076	399.75	2.624661e+07
261	Zambia	5.394250	2.428099e+10	185.556359	2.687290	49.934484	233.75	1.563322e+07
262	Zimbabwe	3.943000	1.549551e+10	115.519881	2.695188	49.529875	398.00	1.542096e+07

263 rows × 8 columns

Notice the ... in the center of this listing, to show that it has not printed some rows.

Sometimes we do not want to see all these rows, but only - say - the top five rows. The head method of the data frame is a useful way to do this:

gender_data.head()

	country	fert_rate	gdp	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
0	Afghanistan	4.95450	1.996102e+10	161.138034	2.834598	40.109708	444.00	32715838.4
1	Albania	1.76925	1.232759e+10	574.202694	2.836021	47.201082	29.25	2888280.2
2	Algeria	2.86600	1.907346e+11	870.766508	4.984252	47.675617	142.50	39099060.4
3	American Samoa	NaN	6.405000e+08	NaN	NaN	NaN	NaN	55422.0
4	Andorra	NaN	3.197538e+09	4421.224933	7.260281	47.123345	NaN	79547.4

The Series also has a head method, that does the same thing:

gdp.head()

  1.996102e+10
  1.232759e+10
  1.907346e+11
  6.405000e+08
  3.197538e+09
Name: gdp, dtype: float64

Selecting rows

We often want to select rows from the data frame that match some criterion.

Say we want to select the rows corresponding the countries with a high GDP.

Looking at the histogram of gdp above, we could try this as a threshold to identify high GDP countries.

high_gdp = gdp > 1e13
high_gdp

    False
    False
    False
    False
    False
    False
    False
    False
    False
    False
   False
   False
   False
   False
   False
   False
   False
   False
   False
   False
   False
   False
   False
   False
   False
   False
   False
   False
   False
   False
       ...  
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
   True
   True
  False
  False
  False
  False
  False
  False
  False
   True
  False
  False
  False
Name: gdp, Length: 263, dtype: bool

type(high_gdp)

pandas.core.series.Series

Notice that high_gdp is a Boolean series, like the Boolean arrays you have already seen. It has True for elements corresponding to countries with gdp value greater than 1e13 and False otherwise.

We can use this Boolean series to select rows from the data frame. The loc attribute of the data frame allows us to LOCate values in the data frame. For our Boolean series, it works like this:

rich_gender_data = gender_data.loc[high_gdp]
rich_gender_data

	country	fert_rate	gdp	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
44	China	1.558750	1.018279e+13	657.748859	3.015530	46.297964	28.75	1.364446e+09
60	Early-demographic dividend	2.636376	1.019283e+13	392.428268	2.595967	48.651143	169.00	3.083697e+09
61	East Asia & Pacific	1.781424	2.168128e+13	835.974259	4.687596	47.212490	63.00	2.265974e+09
62	East Asia & Pacific (IDA & IBRD)	1.811850	1.239991e+13	558.711100	2.815573	47.098031	66.75	1.996942e+09
63	East Asia & Pacific (excluding high income)	1.813950	1.242383e+13	558.702327	2.815498	47.115173	66.75	2.022090e+09
71	Euro area	1.551004	1.255692e+13	3913.466364	7.956080	48.610030	6.50	3.384615e+08
72	Europe & Central Asia	1.738094	2.191519e+13	2518.566323	7.130694	48.653599	16.75	9.032073e+08
75	European Union	1.570012	1.731910e+13	3448.910224	7.816628	48.658777	8.00	5.082110e+08
98	High income	1.686585	4.884635e+13	5045.885008	7.602022	48.701030	10.00	1.175934e+09
102	IBRD only	2.103185	2.607726e+13	625.357428	3.104682	48.026892	106.25	4.607548e+09
103	IDA & IBRD total	2.587557	2.803020e+13	507.044258	3.013687	47.896337	244.75	6.113147e+09
128	Late-demographic dividend	1.667820	1.862014e+13	838.259199	3.335746	47.046383	36.00	2.235352e+09
140	Low & middle income	2.595258	2.724634e+13	498.193061	2.982075	NaN	245.25	6.089148e+09
160	Middle income	2.361746	2.691014e+13	542.340940	2.996480	48.044140	185.75	5.468296e+09
177	North America	1.834404	1.908336e+13	8615.535450	8.066182	48.708683	13.50	3.541404e+08
180	OECD members	1.749418	4.787743e+13	4566.959377	7.636198	48.704364	15.00	1.273149e+09
193	Post-demographic dividend	1.636470	4.541806e+13	5124.214162	7.858170	48.649863	10.75	1.092637e+09
250	United States	1.860875	1.736912e+13	9060.068657	8.121961	48.758830	14.00	3.185582e+08
251	Upper middle income	1.795244	2.097441e+13	870.897512	3.358153	47.112001	43.25	2.540966e+09
259	World	2.464282	7.613006e+13	1223.941243	5.947058	48.076575	223.75	7.269321e+09

type(rich_gender_data)

pandas.core.frame.DataFrame

rich_gender_data is a new data frame, that is a subset of the original gender_data frame. It contains only the rows where the GDP value is greater than 1e13 dollars. Check the display of rich_gender_data above to confirm that the values in the gdp column are all greater than 1e13.

We can do a scatter plot of GDP values against maternal mortality rate, and we find, oddly, that for rich countries, there is little relationship between GDP and maternal mortality.

rich_gender_data.plot.scatter('gdp', 'mat_mort_ratio')

<matplotlib.axes._subplots.AxesSubplot at 0x1178bb7b8>

png

Sorting data frames

Data frames have a method sort_value. This returns a new data frame with the rows sorted by the values in the column we specify.

Here are the first five rows of the original data frame:

gender_data.head()

	country	fert_rate	gdp	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
0	Afghanistan	4.95450	1.996102e+10	161.138034	2.834598	40.109708	444.00	32715838.4
1	Albania	1.76925	1.232759e+10	574.202694	2.836021	47.201082	29.25	2888280.2
2	Algeria	2.86600	1.907346e+11	870.766508	4.984252	47.675617	142.50	39099060.4
3	American Samoa	NaN	6.405000e+08	NaN	NaN	NaN	NaN	55422.0
4	Andorra	NaN	3.197538e+09	4421.224933	7.260281	47.123345	NaN	79547.4

We can make a new data frame where the rows are sorted by the values in the gdp column:

gender_data_by_gdp = gender_data.sort_values('gdp')
gender_data_by_gdp.head()

	country	fert_rate	gdp	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
245	Tuvalu	NaN	3.646999e+07	563.500592	15.506929	47.472414	NaN	10910.0
169	Nauru	NaN	1.063908e+08	611.020856	4.527932	49.409439	NaN	11695.4
121	Kiribati	3.74525	1.774306e+08	179.372383	8.279184	49.557255	95.0	110481.6
152	Marshall Islands	NaN	1.843189e+08	657.059335	14.302529	48.397282	NaN	52882.8
185	Palau	2.22000	2.548400e+08	1389.993077	6.623460	46.425690	NaN	21112.6

Notice that the rows are in ascending order of gdp. You can imagine, that we often want descending order. As usual you can explore how you might do this by looking at the help for the sort_values method with:

gender_data.sort_values?

in a new cell. If you do that, you discover the ascending argument, that you can use like this:

gender_data.sort_values('gdp', ascending=False)

	country	fert_rate	gdp	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
259	World	2.464282	7.613006e+13	1223.941243	5.947058	48.076575	223.75	7.269321e+09
98	High income	1.686585	4.884635e+13	5045.885008	7.602022	48.701030	10.00	1.175934e+09
180	OECD members	1.749418	4.787743e+13	4566.959377	7.636198	48.704364	15.00	1.273149e+09
193	Post-demographic dividend	1.636470	4.541806e+13	5124.214162	7.858170	48.649863	10.75	1.092637e+09
103	IDA & IBRD total	2.587557	2.803020e+13	507.044258	3.013687	47.896337	244.75	6.113147e+09
140	Low & middle income	2.595258	2.724634e+13	498.193061	2.982075	NaN	245.25	6.089148e+09
160	Middle income	2.361746	2.691014e+13	542.340940	2.996480	48.044140	185.75	5.468296e+09
102	IBRD only	2.103185	2.607726e+13	625.357428	3.104682	48.026892	106.25	4.607548e+09
72	Europe & Central Asia	1.738094	2.191519e+13	2518.566323	7.130694	48.653599	16.75	9.032073e+08
61	East Asia & Pacific	1.781424	2.168128e+13	835.974259	4.687596	47.212490	63.00	2.265974e+09
251	Upper middle income	1.795244	2.097441e+13	870.897512	3.358153	47.112001	43.25	2.540966e+09
177	North America	1.834404	1.908336e+13	8615.535450	8.066182	48.708683	13.50	3.541404e+08
128	Late-demographic dividend	1.667820	1.862014e+13	838.259199	3.335746	47.046383	36.00	2.235352e+09
250	United States	1.860875	1.736912e+13	9060.068657	8.121961	48.758830	14.00	3.185582e+08
75	European Union	1.570012	1.731910e+13	3448.910224	7.816628	48.658777	8.00	5.082110e+08
71	Euro area	1.551004	1.255692e+13	3913.466364	7.956080	48.610030	6.50	3.384615e+08
63	East Asia & Pacific (excluding high income)	1.813950	1.242383e+13	558.702327	2.815498	47.115173	66.75	2.022090e+09
62	East Asia & Pacific (IDA & IBRD)	1.811850	1.239991e+13	558.711100	2.815573	47.098031	66.75	1.996942e+09
60	Early-demographic dividend	2.636376	1.019283e+13	392.428268	2.595967	48.651143	169.00	3.083697e+09
44	China	1.558750	1.018279e+13	657.748859	3.015530	46.297964	28.75	1.364446e+09
142	Lower middle income	2.877481	5.929275e+12	254.618642	1.657578	48.632937	262.50	2.927329e+09
129	Latin America & Caribbean	2.125164	5.845138e+12	1069.141865	3.659621	48.378604	70.75	6.242184e+08
130	Latin America & Caribbean (IDA & IBRD)	2.138122	5.636800e+12	1050.010897	3.585924	48.382980	71.50	6.080342e+08
131	Latin America & Caribbean (excluding high income)	2.140742	5.377437e+12	1046.116268	3.638974	48.379046	72.75	5.969302e+08
117	Japan	1.430000	5.106025e+12	3687.126279	8.496074	48.744199	5.75	1.272971e+08
73	Europe & Central Asia (IDA & IBRD)	1.861440	4.213635e+12	1173.135713	3.878399	48.632033	24.00	4.505588e+08
74	Europe & Central Asia (excluding high income)	1.909466	3.710324e+12	1138.169165	3.794867	48.619737	25.50	4.125489e+08
85	Germany	1.450000	3.601226e+12	4909.659884	8.542615	48.568695	6.25	8.128164e+07
157	Middle East & North Africa	2.852531	3.380647e+12	912.491552	3.055660	47.621273	83.50	4.208876e+08
249	United Kingdom	1.842500	2.768864e+12	3357.983675	7.720655	48.791809	9.25	6.464156e+07
...	...	...	...	...	...	...	...	...
254	Vanuatu	3.364750	7.828760e+08	125.568712	3.689874	47.301617	82.50	2.588964e+05
224	St. Vincent and the Grenadines	1.986000	7.301068e+08	775.803386	4.365757	48.536415	45.75	1.094206e+05
3	American Samoa	NaN	6.405000e+08	NaN	NaN	NaN	NaN	5.542200e+04
46	Comoros	4.524000	6.039190e+08	99.221938	2.289039	47.515774	349.50	7.595556e+05
58	Dominica	NaN	5.130350e+08	572.413734	3.749851	48.774225	NaN	7.278540e+04
239	Tonga	3.745750	4.391789e+08	250.962504	3.987285	47.697931	129.25	1.059094e+05
156	Micronesia, Fed. Sts.	3.269250	3.193208e+08	456.403851	12.037277	48.084030	103.25	1.041180e+05
202	Sao Tome and Principe	4.603750	3.145400e+08	304.237533	3.227125	48.664169	159.50	1.913326e+05
185	Palau	2.220000	2.548400e+08	1389.993077	6.623460	46.425690	NaN	2.111260e+04
152	Marshall Islands	NaN	1.843189e+08	657.059335	14.302529	48.397282	NaN	5.288280e+04
121	Kiribati	3.745250	1.774306e+08	179.372383	8.279184	49.557255	95.00	1.104816e+05
169	Nauru	NaN	1.063908e+08	611.020856	4.527932	49.409439	NaN	1.169540e+04
245	Tuvalu	NaN	3.646999e+07	563.500592	15.506929	47.472414	NaN	1.091000e+04
10	Aruba	1.663250	NaN	NaN	NaN	48.721939	NaN	1.037444e+05
28	British Virgin Islands	NaN	NaN	NaN	NaN	47.581520	NaN	2.958540e+04
38	Cayman Islands	NaN	NaN	NaN	NaN	49.866695	NaN	5.915880e+04
42	Channel Islands	1.462250	NaN	NaN	NaN	NaN	NaN	1.629612e+05
53	Curacao	2.050000	NaN	NaN	NaN	47.986740	NaN	1.559594e+05
68	Eritrea	4.324500	NaN	47.155226	1.429902	46.331369	524.50	NaN
81	French Polynesia	2.051000	NaN	NaN	NaN	NaN	NaN	2.757226e+05
87	Gibraltar	NaN	NaN	NaN	NaN	NaN	NaN	3.402560e+04
122	Korea, Dem. People's Rep.	1.982500	NaN	NaN	NaN	49.030281	86.25	2.511378e+07
137	Libya	2.485750	NaN	955.388861	3.231360	NaN	9.00	6.225309e+06
162	Monaco	NaN	NaN	6421.678715	3.720085	49.637143	NaN	3.813840e+04
172	New Caledonia	2.250000	NaN	NaN	NaN	NaN	NaN	2.682000e+05
201	San Marino	1.260000	NaN	3437.298747	5.752693	45.616261	NaN	3.260740e+04
209	Sint Maarten (Dutch part)	NaN	NaN	NaN	NaN	49.508551	NaN	3.755220e+04
223	St. Martin (French part)	1.812500	NaN	NaN	NaN	NaN	NaN	3.149120e+04
233	Syrian Arab Republic	2.967750	NaN	269.945739	1.507166	48.047394	62.00	1.931967e+07
244	Turks and Caicos Islands	NaN	NaN	NaN	NaN	48.846884	NaN	3.370340e+04

263 rows × 8 columns

As you might have guessed by now, Series also have a sort_values method. For a Series, you don’t have to specify the column to sort from, because you are using the Series values.

gdp.sort_values(ascending=False)

  7.613006e+13
   4.884635e+13
  4.787743e+13
  4.541806e+13
  2.803020e+13
  2.724634e+13
  2.691014e+13
  2.607726e+13
   2.191519e+13
   2.168128e+13
  2.097441e+13
  1.908336e+13
  1.862014e+13
  1.736912e+13
   1.731910e+13
   1.255692e+13
   1.242383e+13
   1.239991e+13
   1.019283e+13
   1.018279e+13
  5.929275e+12
  5.845138e+12
  5.636800e+12
  5.377437e+12
  5.106025e+12
   4.213635e+12
   3.710324e+12
   3.601226e+12
  3.380647e+12
  2.768864e+12
           ...     
  7.828760e+08
  7.301068e+08
    6.405000e+08
   6.039190e+08
   5.130350e+08
  4.391789e+08
  3.193208e+08
  3.145400e+08
  2.548400e+08
  1.843189e+08
  1.774306e+08
  1.063908e+08
  3.646999e+07
            NaN
            NaN
            NaN
            NaN
            NaN
            NaN
            NaN
            NaN
           NaN
           NaN
           NaN
           NaN
           NaN
           NaN
           NaN
           NaN
           NaN
Name: gdp, Length: 263, dtype: float64

Introduction to data frames

Data frames and series

Missing values and NaN

Plotting with methods

Showing the top 5 values with the head method

Selecting rows

Sorting data frames

Missing values and `NaN`

Showing the top 5 values with the `head` method