4.1 Introduction to data frames
Introduction to data frames
Start by loading the usual plotting libraries.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Make plots look a little bit more fancy
plt.style.use('fivethirtyeight')
Pandas is a Python package that implements data frames, and functions that operate on data frames.
import pandas as pd
Data frames and series
We start by loading data from a Comma Separated Value file (CSV file). If you are running on your laptop, you should download the gender_stats.csv file to the same directory as this notebook.
# Load the data file
gender_data = pd.read_csv('gender_stats.csv')
This is our usual assignment statement. The LHS is gender_data, the variable name.  The RHS is an expression, that returns a value.
What type of value does it return?
type(gender_data)
pandas.core.frame.DataFrame
Pandas integrates with the Notebook, so, if you display a data frame in the notebook, it does a nice display of rows and columns.
gender_data
| country | fert_rate | gdp | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 4.954500 | 1.996102e+10 | 161.138034 | 2.834598 | 40.109708 | 444.00 | 3.271584e+07 | 
| 1 | Albania | 1.769250 | 1.232759e+10 | 574.202694 | 2.836021 | 47.201082 | 29.25 | 2.888280e+06 | 
| 2 | Algeria | 2.866000 | 1.907346e+11 | 870.766508 | 4.984252 | 47.675617 | 142.50 | 3.909906e+07 | 
| 3 | American Samoa | NaN | 6.405000e+08 | NaN | NaN | NaN | NaN | 5.542200e+04 | 
| 4 | Andorra | NaN | 3.197538e+09 | 4421.224933 | 7.260281 | 47.123345 | NaN | 7.954740e+04 | 
| 5 | Angola | 6.123000 | 1.119365e+11 | 254.747970 | 2.447546 | NaN | 501.25 | 2.693754e+07 | 
| 6 | Antigua and Barbuda | 2.082000 | 1.298213e+09 | 1152.493656 | 3.676514 | 48.291463 | NaN | 9.887240e+04 | 
| 7 | Arab World | 3.397587 | 2.709059e+12 | 761.401727 | 2.873840 | 47.119776 | 161.00 | 3.899620e+08 | 
| 8 | Argentina | 2.328000 | 5.509810e+11 | 1148.256142 | 2.782216 | 48.915810 | 53.75 | 4.297667e+07 | 
| 9 | Armenia | 1.545500 | 1.088536e+10 | 348.663884 | 1.916016 | 46.782180 | 27.25 | 2.904683e+06 | 
| 10 | Aruba | 1.663250 | NaN | NaN | NaN | 48.721939 | NaN | 1.037444e+05 | 
| 11 | Australia | 1.861500 | 1.422994e+12 | 4256.058988 | 6.292381 | 48.576707 | 6.00 | 2.344456e+07 | 
| 12 | Austria | 1.455000 | 4.074943e+11 | 4930.298893 | 8.504276 | 48.556078 | 4.00 | 8.566294e+06 | 
| 13 | Azerbaijan | 1.980000 | 6.200300e+10 | 956.709718 | 1.197249 | 46.157363 | 25.25 | 9.531856e+06 | 
| 14 | Bahamas, The | 1.877250 | 8.688000e+09 | 1727.128385 | 3.308626 | NaN | 81.50 | 3.819036e+05 | 
| 15 | Bahrain | 2.065250 | 3.200401e+10 | 2030.158316 | 2.976386 | 49.116838 | 15.25 | 1.349810e+06 | 
| 16 | Bangladesh | 2.193250 | 1.745451e+11 | 85.968844 | 0.860447 | 50.460564 | 194.75 | 1.593712e+08 | 
| 17 | Barbados | 1.792250 | 4.413080e+09 | 1062.840088 | 4.828680 | 48.878181 | 28.00 | 2.833384e+05 | 
| 18 | Belarus | 1.677000 | 6.478294e+10 | 986.236757 | 3.876601 | 48.685741 | 4.00 | 9.480348e+06 | 
| 19 | Belgium | 1.755000 | 4.942218e+11 | 4297.838005 | 8.221003 | 48.864675 | 7.00 | 1.122850e+07 | 
| 20 | Belize | 2.594750 | 1.680325e+09 | 471.967465 | 3.744844 | 48.317238 | 29.25 | 3.517636e+05 | 
| 21 | Benin | 4.806750 | 8.778151e+09 | 83.726190 | 2.206916 | 47.211127 | 417.50 | 1.029371e+07 | 
| 22 | Bermuda | 1.617500 | 5.555624e+09 | NaN | NaN | 48.423588 | NaN | 6.510080e+04 | 
| 23 | Bhutan | 2.061250 | 1.975145e+09 | 277.526670 | 2.706908 | 49.572296 | 161.75 | 7.759054e+05 | 
| 24 | Bolivia | 2.995250 | 3.150932e+10 | 381.007594 | 4.192031 | 48.464175 | 218.25 | 1.056280e+07 | 
| 25 | Bosnia and Herzegovina | 1.267000 | 1.732333e+10 | 941.504655 | 6.841021 | 48.634905 | 11.75 | 3.574396e+06 | 
| 26 | Botswana | 2.845000 | 1.511339e+10 | 880.909202 | 3.552071 | 48.844009 | 138.75 | 2.169170e+06 | 
| 27 | Brazil | 1.795250 | 2.198766e+12 | 1303.199104 | 3.773473 | 47.784577 | 49.50 | 2.041595e+08 | 
| 28 | British Virgin Islands | NaN | NaN | NaN | NaN | 47.581520 | NaN | 2.958540e+04 | 
| 29 | Brunei Darussalam | 1.884000 | 1.571922e+10 | 1795.924160 | 2.335194 | 48.523699 | 23.75 | 4.115812e+05 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 233 | Syrian Arab Republic | 2.967750 | NaN | 269.945739 | 1.507166 | 48.047394 | 62.00 | 1.931967e+07 | 
| 234 | Tajikistan | 3.495750 | 8.036228e+09 | 169.745970 | 1.976367 | 48.260680 | 33.25 | 8.363844e+06 | 
| 235 | Tanzania | 5.181250 | 4.493554e+10 | 131.704162 | 2.648609 | 50.666580 | 429.50 | 5.228132e+07 | 
| 236 | Thailand | 1.516750 | 4.061369e+11 | 581.927487 | 3.183842 | 48.213034 | 21.00 | 6.838499e+07 | 
| 237 | Timor-Leste | 5.797750 | 1.361430e+09 | 98.577296 | 1.140440 | 48.337367 | 240.25 | 1.212718e+06 | 
| 238 | Togo | 4.620000 | 4.183610e+09 | 71.263825 | 2.037809 | 48.270471 | 380.75 | 7.230904e+06 | 
| 239 | Tonga | 3.745750 | 4.391789e+08 | 250.962504 | 3.987285 | 47.697931 | 129.25 | 1.059094e+05 | 
| 240 | Trinidad and Tobago | 1.782750 | 2.457095e+10 | 1778.148073 | 3.071370 | NaN | 63.25 | 1.353877e+06 | 
| 241 | Tunisia | 2.140000 | 4.482437e+10 | 782.950522 | 4.118771 | 48.142132 | 63.25 | 1.114441e+07 | 
| 242 | Turkey | 2.078000 | 8.951756e+11 | 997.374772 | 4.189521 | 48.789477 | 17.50 | 7.703435e+07 | 
| 243 | Turkmenistan | 2.313750 | 3.797310e+10 | 288.572644 | 1.349303 | 48.906879 | 43.50 | 5.465637e+06 | 
| 244 | Turks and Caicos Islands | NaN | NaN | NaN | NaN | 48.846884 | NaN | 3.370340e+04 | 
| 245 | Tuvalu | NaN | 3.646999e+07 | 563.500592 | 15.506929 | 47.472414 | NaN | 1.091000e+04 | 
| 246 | Uganda | 5.822500 | 2.594146e+10 | 132.892684 | 2.014349 | 50.099485 | 366.50 | 3.886534e+07 | 
| 247 | Ukraine | 1.510250 | 1.353793e+11 | 628.579254 | 3.960185 | 48.984198 | 24.25 | 4.530270e+07 | 
| 248 | United Arab Emirates | 1.793000 | 3.750271e+11 | 2202.407569 | 2.581168 | 48.789260 | 6.00 | 9.080299e+06 | 
| 249 | United Kingdom | 1.842500 | 2.768864e+12 | 3357.983675 | 7.720655 | 48.791809 | 9.25 | 6.464156e+07 | 
| 250 | United States | 1.860875 | 1.736912e+13 | 9060.068657 | 8.121961 | 48.758830 | 14.00 | 3.185582e+08 | 
| 251 | Upper middle income | 1.795244 | 2.097441e+13 | 870.897512 | 3.358153 | 47.112001 | 43.25 | 2.540966e+09 | 
| 252 | Uruguay | 2.027000 | 5.434513e+10 | 1721.507752 | 6.044403 | 48.295555 | 15.50 | 3.419977e+06 | 
| 253 | Uzbekistan | 2.372750 | 6.134065e+10 | 334.476754 | 3.118842 | 48.387434 | 37.00 | 3.078450e+07 | 
| 254 | Vanuatu | 3.364750 | 7.828760e+08 | 125.568712 | 3.689874 | 47.301617 | 82.50 | 2.588964e+05 | 
| 255 | Venezuela, RB | 2.378250 | 3.761463e+11 | 896.815314 | 1.587088 | 48.400934 | 97.00 | 3.073452e+07 | 
| 256 | Vietnam | 1.959500 | 1.818207e+11 | 368.374550 | 3.779501 | 48.021053 | 54.75 | 9.074240e+07 | 
| 257 | Virgin Islands (U.S.) | 1.760000 | 3.812000e+09 | NaN | NaN | NaN | NaN | 1.041414e+05 | 
| 258 | West Bank and Gaza | 4.208000 | 1.250822e+10 | NaN | NaN | 48.828520 | 47.50 | 4.296960e+06 | 
| 259 | World | 2.464282 | 7.613006e+13 | 1223.941243 | 5.947058 | 48.076575 | 223.75 | 7.269321e+09 | 
| 260 | Yemen, Rep. | 4.225750 | 3.681934e+10 | 207.949700 | 1.417836 | 44.470076 | 399.75 | 2.624661e+07 | 
| 261 | Zambia | 5.394250 | 2.428099e+10 | 185.556359 | 2.687290 | 49.934484 | 233.75 | 1.563322e+07 | 
| 262 | Zimbabwe | 3.943000 | 1.549551e+10 | 115.519881 | 2.695188 | 49.529875 | 398.00 | 1.542096e+07 | 
263 rows × 8 columns
The data frame has rows and columns.  Like other Python objects, it has attributes.  These are pieces of data associated with the data frame.  You have already seen methods, which are functions associated with the data frame.  You can access attributes in the same way as you access methods, by typing the variable name, followed by a dot ., followed by the attribute name.
For example, one attribute of the data frame, is the shape:
gender_data.shape
(263, 8)
Another attribute is columns.  This has the column names. For
example, here is a good way of quickly seeing the column names
for a data frame:
gender_data.columns
Index(['country', 'fert_rate', 'gdp', 'health_exp_per_cap', 'health_exp_pub',
       'prim_ed_girls', 'mat_mort_ratio', 'population'],
      dtype='object')
You need more information about what these column names refer to. Here are the longer descriptions from the original data source (link above):
fert_rate: Fertility rate, total (births per woman).gdp: GDP (current US$).health_exp_per_cap: Health expenditure per capita, PPP (constant 2011 international \$).health_exp_pub: Health expenditure, public (% of GDP).prim_ed_girls: Primary education, pupils (% female).mat_mort_ratio: Maternal mortality ratio (modeled estimate, per 100,000 live births).population: Population, total.
You have just seen array slicing (in Selecting with arrays. You remember that array slicing uses square brackets. Data frames also allow slicing. For example, we often want to get all the data for a single column of the data frame. To do this, we use the same square bracket notation as we use for array slicing, with the name of the column inside the square brackets.
gdp = gender_data['gdp']
What type of thing is this column of data?
type(gdp)
pandas.core.series.Series
Here are the values for gdp.  You will notice that these are
the same values you saw in the “gdp” column when you displayed
the whole data frame.
gdp
0      1.996102e+10
1      1.232759e+10
2      1.907346e+11
3      6.405000e+08
4      3.197538e+09
5      1.119365e+11
6      1.298213e+09
7      2.709059e+12
8      5.509810e+11
9      1.088536e+10
10              NaN
11     1.422994e+12
12     4.074943e+11
13     6.200300e+10
14     8.688000e+09
15     3.200401e+10
16     1.745451e+11
17     4.413080e+09
18     6.478294e+10
19     4.942218e+11
20     1.680325e+09
21     8.778151e+09
22     5.555624e+09
23     1.975145e+09
24     3.150932e+10
25     1.732333e+10
26     1.511339e+10
27     2.198766e+12
28              NaN
29     1.571922e+10
           ...     
233             NaN
234    8.036228e+09
235    4.493554e+10
236    4.061369e+11
237    1.361430e+09
238    4.183610e+09
239    4.391789e+08
240    2.457095e+10
241    4.482437e+10
242    8.951756e+11
243    3.797310e+10
244             NaN
245    3.646999e+07
246    2.594146e+10
247    1.353793e+11
248    3.750271e+11
249    2.768864e+12
250    1.736912e+13
251    2.097441e+13
252    5.434513e+10
253    6.134065e+10
254    7.828760e+08
255    3.761463e+11
256    1.818207e+11
257    3.812000e+09
258    1.250822e+10
259    7.613006e+13
260    3.681934e+10
261    2.428099e+10
262    1.549551e+10
Name: gdp, Length: 263, dtype: float64
What are these values like 6.405000e+08?  These are numbers,
in scientific
notation.
Scientific notation is a compact way of writing very large or
very small numbers.  The value after e above is the
exponent, in this case 08.  The number above means $6.405
- 10^8$. For example, here is $2 * 10^7$:
 
2e7
20000000.0
Missing values and NaN
Looking at the values of gdp (and therefore, the values of the
gdp column of gender_data, we see that some of the values
are NaN, which means Not a Number.  Pandas uses this marker to
indicate values that are not available, or missing data.
Numpy does not like to calculate with NaN values.  Here is Numpy trying to calculate the median of the gdp values.
np.median(gdp)
nan
Notice the warning about an invalid value.
Numpy recognizes that one or more values are NaN and refuses to guess what to do, when calculating the median.
You saw from the shape above that gender_data has 263 rows.  We can use the general Python len function, to see how many elements there are in gdp.
len(gdp)
263
As expected, it has the same number of elements as there are rows in gender_data.
The count method of the series gives the number of values that are not missing - that is - not NaN.
gdp.count()
246
Plotting with methods
The gdp variable is a sequence of values, so we can do a histogram on these values, as we have done histograms on arrays.
plt.hist(gdp);

Notice the multiple warnings as Matplotlib tried to calculate the bin widths for the histogram.  These are from the NaN values.
Another way to do the histogram, is to use the hist method of the series.
A method is a function attached to a value.  In this case hist is a function attached to a value of type Series.
Using the hist method instead of the plt.hist function can make the code a bit easier to read.  The method also has the advantage that it discards the NaN values, by default, so it does not generate the same warnings.
gdp.hist();

Now we have had a look at the GDP values, we will look at the
values for the mat_mort_ratio column.  These are the numbers
of women who die in childbirth for every 100,000 births.
mmr = gender_data['mat_mort_ratio']
mmr
0      444.00
1       29.25
2      142.50
3         NaN
4         NaN
5      501.25
6         NaN
7      161.00
8       53.75
9       27.25
10        NaN
11       6.00
12       4.00
13      25.25
14      81.50
15      15.25
16     194.75
17      28.00
18       4.00
19       7.00
20      29.25
21     417.50
22        NaN
23     161.75
24     218.25
25      11.75
26     138.75
27      49.50
28        NaN
29      23.75
        ...  
233     62.00
234     33.25
235    429.50
236     21.00
237    240.25
238    380.75
239    129.25
240     63.25
241     63.25
242     17.50
243     43.50
244       NaN
245       NaN
246    366.50
247     24.25
248      6.00
249      9.25
250     14.00
251     43.25
252     15.50
253     37.00
254     82.50
255     97.00
256     54.75
257       NaN
258     47.50
259    223.75
260    399.75
261    233.75
262    398.00
Name: mat_mort_ratio, Length: 263, dtype: float64
mmr.hist();

We are interested in the relationship of gpp and mmr.  Maybe richer countries have better health care, and fewer maternal deaths.
Here is a plot, using the standard Matplotlib scatter
function.
plt.scatter(gdp, mmr);

We can do the same plot using the plot.scatter method on the data frame.  In that case, we specify the column names that should go on the x and the y axes.
gender_data.plot.scatter('gdp', 'mat_mort_ratio');

An advantage of doing it this way is that we get the column names on the x and y axes by default.
Showing the top 5 values with the head method
We have already seen that Pandas will display the data frame with nice formatting. If the data frame is long, it will display only the first few and the last few rows:
gender_data
| country | fert_rate | gdp | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 4.954500 | 1.996102e+10 | 161.138034 | 2.834598 | 40.109708 | 444.00 | 3.271584e+07 | 
| 1 | Albania | 1.769250 | 1.232759e+10 | 574.202694 | 2.836021 | 47.201082 | 29.25 | 2.888280e+06 | 
| 2 | Algeria | 2.866000 | 1.907346e+11 | 870.766508 | 4.984252 | 47.675617 | 142.50 | 3.909906e+07 | 
| 3 | American Samoa | NaN | 6.405000e+08 | NaN | NaN | NaN | NaN | 5.542200e+04 | 
| 4 | Andorra | NaN | 3.197538e+09 | 4421.224933 | 7.260281 | 47.123345 | NaN | 7.954740e+04 | 
| 5 | Angola | 6.123000 | 1.119365e+11 | 254.747970 | 2.447546 | NaN | 501.25 | 2.693754e+07 | 
| 6 | Antigua and Barbuda | 2.082000 | 1.298213e+09 | 1152.493656 | 3.676514 | 48.291463 | NaN | 9.887240e+04 | 
| 7 | Arab World | 3.397587 | 2.709059e+12 | 761.401727 | 2.873840 | 47.119776 | 161.00 | 3.899620e+08 | 
| 8 | Argentina | 2.328000 | 5.509810e+11 | 1148.256142 | 2.782216 | 48.915810 | 53.75 | 4.297667e+07 | 
| 9 | Armenia | 1.545500 | 1.088536e+10 | 348.663884 | 1.916016 | 46.782180 | 27.25 | 2.904683e+06 | 
| 10 | Aruba | 1.663250 | NaN | NaN | NaN | 48.721939 | NaN | 1.037444e+05 | 
| 11 | Australia | 1.861500 | 1.422994e+12 | 4256.058988 | 6.292381 | 48.576707 | 6.00 | 2.344456e+07 | 
| 12 | Austria | 1.455000 | 4.074943e+11 | 4930.298893 | 8.504276 | 48.556078 | 4.00 | 8.566294e+06 | 
| 13 | Azerbaijan | 1.980000 | 6.200300e+10 | 956.709718 | 1.197249 | 46.157363 | 25.25 | 9.531856e+06 | 
| 14 | Bahamas, The | 1.877250 | 8.688000e+09 | 1727.128385 | 3.308626 | NaN | 81.50 | 3.819036e+05 | 
| 15 | Bahrain | 2.065250 | 3.200401e+10 | 2030.158316 | 2.976386 | 49.116838 | 15.25 | 1.349810e+06 | 
| 16 | Bangladesh | 2.193250 | 1.745451e+11 | 85.968844 | 0.860447 | 50.460564 | 194.75 | 1.593712e+08 | 
| 17 | Barbados | 1.792250 | 4.413080e+09 | 1062.840088 | 4.828680 | 48.878181 | 28.00 | 2.833384e+05 | 
| 18 | Belarus | 1.677000 | 6.478294e+10 | 986.236757 | 3.876601 | 48.685741 | 4.00 | 9.480348e+06 | 
| 19 | Belgium | 1.755000 | 4.942218e+11 | 4297.838005 | 8.221003 | 48.864675 | 7.00 | 1.122850e+07 | 
| 20 | Belize | 2.594750 | 1.680325e+09 | 471.967465 | 3.744844 | 48.317238 | 29.25 | 3.517636e+05 | 
| 21 | Benin | 4.806750 | 8.778151e+09 | 83.726190 | 2.206916 | 47.211127 | 417.50 | 1.029371e+07 | 
| 22 | Bermuda | 1.617500 | 5.555624e+09 | NaN | NaN | 48.423588 | NaN | 6.510080e+04 | 
| 23 | Bhutan | 2.061250 | 1.975145e+09 | 277.526670 | 2.706908 | 49.572296 | 161.75 | 7.759054e+05 | 
| 24 | Bolivia | 2.995250 | 3.150932e+10 | 381.007594 | 4.192031 | 48.464175 | 218.25 | 1.056280e+07 | 
| 25 | Bosnia and Herzegovina | 1.267000 | 1.732333e+10 | 941.504655 | 6.841021 | 48.634905 | 11.75 | 3.574396e+06 | 
| 26 | Botswana | 2.845000 | 1.511339e+10 | 880.909202 | 3.552071 | 48.844009 | 138.75 | 2.169170e+06 | 
| 27 | Brazil | 1.795250 | 2.198766e+12 | 1303.199104 | 3.773473 | 47.784577 | 49.50 | 2.041595e+08 | 
| 28 | British Virgin Islands | NaN | NaN | NaN | NaN | 47.581520 | NaN | 2.958540e+04 | 
| 29 | Brunei Darussalam | 1.884000 | 1.571922e+10 | 1795.924160 | 2.335194 | 48.523699 | 23.75 | 4.115812e+05 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 233 | Syrian Arab Republic | 2.967750 | NaN | 269.945739 | 1.507166 | 48.047394 | 62.00 | 1.931967e+07 | 
| 234 | Tajikistan | 3.495750 | 8.036228e+09 | 169.745970 | 1.976367 | 48.260680 | 33.25 | 8.363844e+06 | 
| 235 | Tanzania | 5.181250 | 4.493554e+10 | 131.704162 | 2.648609 | 50.666580 | 429.50 | 5.228132e+07 | 
| 236 | Thailand | 1.516750 | 4.061369e+11 | 581.927487 | 3.183842 | 48.213034 | 21.00 | 6.838499e+07 | 
| 237 | Timor-Leste | 5.797750 | 1.361430e+09 | 98.577296 | 1.140440 | 48.337367 | 240.25 | 1.212718e+06 | 
| 238 | Togo | 4.620000 | 4.183610e+09 | 71.263825 | 2.037809 | 48.270471 | 380.75 | 7.230904e+06 | 
| 239 | Tonga | 3.745750 | 4.391789e+08 | 250.962504 | 3.987285 | 47.697931 | 129.25 | 1.059094e+05 | 
| 240 | Trinidad and Tobago | 1.782750 | 2.457095e+10 | 1778.148073 | 3.071370 | NaN | 63.25 | 1.353877e+06 | 
| 241 | Tunisia | 2.140000 | 4.482437e+10 | 782.950522 | 4.118771 | 48.142132 | 63.25 | 1.114441e+07 | 
| 242 | Turkey | 2.078000 | 8.951756e+11 | 997.374772 | 4.189521 | 48.789477 | 17.50 | 7.703435e+07 | 
| 243 | Turkmenistan | 2.313750 | 3.797310e+10 | 288.572644 | 1.349303 | 48.906879 | 43.50 | 5.465637e+06 | 
| 244 | Turks and Caicos Islands | NaN | NaN | NaN | NaN | 48.846884 | NaN | 3.370340e+04 | 
| 245 | Tuvalu | NaN | 3.646999e+07 | 563.500592 | 15.506929 | 47.472414 | NaN | 1.091000e+04 | 
| 246 | Uganda | 5.822500 | 2.594146e+10 | 132.892684 | 2.014349 | 50.099485 | 366.50 | 3.886534e+07 | 
| 247 | Ukraine | 1.510250 | 1.353793e+11 | 628.579254 | 3.960185 | 48.984198 | 24.25 | 4.530270e+07 | 
| 248 | United Arab Emirates | 1.793000 | 3.750271e+11 | 2202.407569 | 2.581168 | 48.789260 | 6.00 | 9.080299e+06 | 
| 249 | United Kingdom | 1.842500 | 2.768864e+12 | 3357.983675 | 7.720655 | 48.791809 | 9.25 | 6.464156e+07 | 
| 250 | United States | 1.860875 | 1.736912e+13 | 9060.068657 | 8.121961 | 48.758830 | 14.00 | 3.185582e+08 | 
| 251 | Upper middle income | 1.795244 | 2.097441e+13 | 870.897512 | 3.358153 | 47.112001 | 43.25 | 2.540966e+09 | 
| 252 | Uruguay | 2.027000 | 5.434513e+10 | 1721.507752 | 6.044403 | 48.295555 | 15.50 | 3.419977e+06 | 
| 253 | Uzbekistan | 2.372750 | 6.134065e+10 | 334.476754 | 3.118842 | 48.387434 | 37.00 | 3.078450e+07 | 
| 254 | Vanuatu | 3.364750 | 7.828760e+08 | 125.568712 | 3.689874 | 47.301617 | 82.50 | 2.588964e+05 | 
| 255 | Venezuela, RB | 2.378250 | 3.761463e+11 | 896.815314 | 1.587088 | 48.400934 | 97.00 | 3.073452e+07 | 
| 256 | Vietnam | 1.959500 | 1.818207e+11 | 368.374550 | 3.779501 | 48.021053 | 54.75 | 9.074240e+07 | 
| 257 | Virgin Islands (U.S.) | 1.760000 | 3.812000e+09 | NaN | NaN | NaN | NaN | 1.041414e+05 | 
| 258 | West Bank and Gaza | 4.208000 | 1.250822e+10 | NaN | NaN | 48.828520 | 47.50 | 4.296960e+06 | 
| 259 | World | 2.464282 | 7.613006e+13 | 1223.941243 | 5.947058 | 48.076575 | 223.75 | 7.269321e+09 | 
| 260 | Yemen, Rep. | 4.225750 | 3.681934e+10 | 207.949700 | 1.417836 | 44.470076 | 399.75 | 2.624661e+07 | 
| 261 | Zambia | 5.394250 | 2.428099e+10 | 185.556359 | 2.687290 | 49.934484 | 233.75 | 1.563322e+07 | 
| 262 | Zimbabwe | 3.943000 | 1.549551e+10 | 115.519881 | 2.695188 | 49.529875 | 398.00 | 1.542096e+07 | 
263 rows × 8 columns
Notice the ... in the center of this listing, to show that it has not printed some rows.
Sometimes we do not want to see all these rows, but only - say - the top five rows.   The head method of the data frame is a useful way to do this:
gender_data.head()
| country | fert_rate | gdp | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 4.95450 | 1.996102e+10 | 161.138034 | 2.834598 | 40.109708 | 444.00 | 32715838.4 | 
| 1 | Albania | 1.76925 | 1.232759e+10 | 574.202694 | 2.836021 | 47.201082 | 29.25 | 2888280.2 | 
| 2 | Algeria | 2.86600 | 1.907346e+11 | 870.766508 | 4.984252 | 47.675617 | 142.50 | 39099060.4 | 
| 3 | American Samoa | NaN | 6.405000e+08 | NaN | NaN | NaN | NaN | 55422.0 | 
| 4 | Andorra | NaN | 3.197538e+09 | 4421.224933 | 7.260281 | 47.123345 | NaN | 79547.4 | 
The Series also has a head method, that does the same thing:
gdp.head()
0    1.996102e+10
1    1.232759e+10
2    1.907346e+11
3    6.405000e+08
4    3.197538e+09
Name: gdp, dtype: float64
Selecting rows
We often want to select rows from the data frame that match some criterion.
Say we want to select the rows corresponding the countries with a high GDP.
Looking at the histogram of gdp above, we could try this as a threshold to identify high GDP countries.
high_gdp = gdp > 1e13
high_gdp
0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
       ...  
233    False
234    False
235    False
236    False
237    False
238    False
239    False
240    False
241    False
242    False
243    False
244    False
245    False
246    False
247    False
248    False
249    False
250     True
251     True
252    False
253    False
254    False
255    False
256    False
257    False
258    False
259     True
260    False
261    False
262    False
Name: gdp, Length: 263, dtype: bool
type(high_gdp)
pandas.core.series.Series
Notice that high_gdp is a Boolean series, like the Boolean arrays you have already seen.  It has True for elements corresponding to countries with gdp value greater than 1e13 and False otherwise.
We can use this Boolean series to select rows from the data frame.  The loc attribute of the data frame allows us to LOCate values in the data frame.  For our Boolean series, it works like this:
rich_gender_data = gender_data.loc[high_gdp]
rich_gender_data
| country | fert_rate | gdp | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
|---|---|---|---|---|---|---|---|---|
| 44 | China | 1.558750 | 1.018279e+13 | 657.748859 | 3.015530 | 46.297964 | 28.75 | 1.364446e+09 | 
| 60 | Early-demographic dividend | 2.636376 | 1.019283e+13 | 392.428268 | 2.595967 | 48.651143 | 169.00 | 3.083697e+09 | 
| 61 | East Asia & Pacific | 1.781424 | 2.168128e+13 | 835.974259 | 4.687596 | 47.212490 | 63.00 | 2.265974e+09 | 
| 62 | East Asia & Pacific (IDA & IBRD) | 1.811850 | 1.239991e+13 | 558.711100 | 2.815573 | 47.098031 | 66.75 | 1.996942e+09 | 
| 63 | East Asia & Pacific (excluding high income) | 1.813950 | 1.242383e+13 | 558.702327 | 2.815498 | 47.115173 | 66.75 | 2.022090e+09 | 
| 71 | Euro area | 1.551004 | 1.255692e+13 | 3913.466364 | 7.956080 | 48.610030 | 6.50 | 3.384615e+08 | 
| 72 | Europe & Central Asia | 1.738094 | 2.191519e+13 | 2518.566323 | 7.130694 | 48.653599 | 16.75 | 9.032073e+08 | 
| 75 | European Union | 1.570012 | 1.731910e+13 | 3448.910224 | 7.816628 | 48.658777 | 8.00 | 5.082110e+08 | 
| 98 | High income | 1.686585 | 4.884635e+13 | 5045.885008 | 7.602022 | 48.701030 | 10.00 | 1.175934e+09 | 
| 102 | IBRD only | 2.103185 | 2.607726e+13 | 625.357428 | 3.104682 | 48.026892 | 106.25 | 4.607548e+09 | 
| 103 | IDA & IBRD total | 2.587557 | 2.803020e+13 | 507.044258 | 3.013687 | 47.896337 | 244.75 | 6.113147e+09 | 
| 128 | Late-demographic dividend | 1.667820 | 1.862014e+13 | 838.259199 | 3.335746 | 47.046383 | 36.00 | 2.235352e+09 | 
| 140 | Low & middle income | 2.595258 | 2.724634e+13 | 498.193061 | 2.982075 | NaN | 245.25 | 6.089148e+09 | 
| 160 | Middle income | 2.361746 | 2.691014e+13 | 542.340940 | 2.996480 | 48.044140 | 185.75 | 5.468296e+09 | 
| 177 | North America | 1.834404 | 1.908336e+13 | 8615.535450 | 8.066182 | 48.708683 | 13.50 | 3.541404e+08 | 
| 180 | OECD members | 1.749418 | 4.787743e+13 | 4566.959377 | 7.636198 | 48.704364 | 15.00 | 1.273149e+09 | 
| 193 | Post-demographic dividend | 1.636470 | 4.541806e+13 | 5124.214162 | 7.858170 | 48.649863 | 10.75 | 1.092637e+09 | 
| 250 | United States | 1.860875 | 1.736912e+13 | 9060.068657 | 8.121961 | 48.758830 | 14.00 | 3.185582e+08 | 
| 251 | Upper middle income | 1.795244 | 2.097441e+13 | 870.897512 | 3.358153 | 47.112001 | 43.25 | 2.540966e+09 | 
| 259 | World | 2.464282 | 7.613006e+13 | 1223.941243 | 5.947058 | 48.076575 | 223.75 | 7.269321e+09 | 
type(rich_gender_data)
pandas.core.frame.DataFrame
rich_gender_data is a new data frame, that is a subset of the
original gender_data frame.  It contains only the rows where
the GDP value is greater than 1e13 dollars.  Check the display
of rich_gender_data above to confirm that the values in the
gdp column are all greater than 1e13.
We can do a scatter plot of GDP values against maternal mortality rate, and we find, oddly, that for rich countries, there is little relationship between GDP and maternal mortality.
rich_gender_data.plot.scatter('gdp', 'mat_mort_ratio')
<matplotlib.axes._subplots.AxesSubplot at 0x1178bb7b8>

Sorting data frames
Data frames have a method sort_value.  This returns a new data frame with the rows sorted by the values in the column we specify.
Here are the first five rows of the original data frame:
gender_data.head()
| country | fert_rate | gdp | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 4.95450 | 1.996102e+10 | 161.138034 | 2.834598 | 40.109708 | 444.00 | 32715838.4 | 
| 1 | Albania | 1.76925 | 1.232759e+10 | 574.202694 | 2.836021 | 47.201082 | 29.25 | 2888280.2 | 
| 2 | Algeria | 2.86600 | 1.907346e+11 | 870.766508 | 4.984252 | 47.675617 | 142.50 | 39099060.4 | 
| 3 | American Samoa | NaN | 6.405000e+08 | NaN | NaN | NaN | NaN | 55422.0 | 
| 4 | Andorra | NaN | 3.197538e+09 | 4421.224933 | 7.260281 | 47.123345 | NaN | 79547.4 | 
We can make a new data frame where the rows are sorted by the values in the gdp column:
gender_data_by_gdp = gender_data.sort_values('gdp')
gender_data_by_gdp.head()
| country | fert_rate | gdp | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
|---|---|---|---|---|---|---|---|---|
| 245 | Tuvalu | NaN | 3.646999e+07 | 563.500592 | 15.506929 | 47.472414 | NaN | 10910.0 | 
| 169 | Nauru | NaN | 1.063908e+08 | 611.020856 | 4.527932 | 49.409439 | NaN | 11695.4 | 
| 121 | Kiribati | 3.74525 | 1.774306e+08 | 179.372383 | 8.279184 | 49.557255 | 95.0 | 110481.6 | 
| 152 | Marshall Islands | NaN | 1.843189e+08 | 657.059335 | 14.302529 | 48.397282 | NaN | 52882.8 | 
| 185 | Palau | 2.22000 | 2.548400e+08 | 1389.993077 | 6.623460 | 46.425690 | NaN | 21112.6 | 
Notice that the rows are in ascending order of gdp.  You can imagine, that we often want descending order.  As usual you can explore how you might do this by looking at the help for the sort_values method with:
gender_data.sort_values?
in a new cell.   If you do that, you discover the ascending argument, that you can use like this:
gender_data.sort_values('gdp', ascending=False)
| country | fert_rate | gdp | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
|---|---|---|---|---|---|---|---|---|
| 259 | World | 2.464282 | 7.613006e+13 | 1223.941243 | 5.947058 | 48.076575 | 223.75 | 7.269321e+09 | 
| 98 | High income | 1.686585 | 4.884635e+13 | 5045.885008 | 7.602022 | 48.701030 | 10.00 | 1.175934e+09 | 
| 180 | OECD members | 1.749418 | 4.787743e+13 | 4566.959377 | 7.636198 | 48.704364 | 15.00 | 1.273149e+09 | 
| 193 | Post-demographic dividend | 1.636470 | 4.541806e+13 | 5124.214162 | 7.858170 | 48.649863 | 10.75 | 1.092637e+09 | 
| 103 | IDA & IBRD total | 2.587557 | 2.803020e+13 | 507.044258 | 3.013687 | 47.896337 | 244.75 | 6.113147e+09 | 
| 140 | Low & middle income | 2.595258 | 2.724634e+13 | 498.193061 | 2.982075 | NaN | 245.25 | 6.089148e+09 | 
| 160 | Middle income | 2.361746 | 2.691014e+13 | 542.340940 | 2.996480 | 48.044140 | 185.75 | 5.468296e+09 | 
| 102 | IBRD only | 2.103185 | 2.607726e+13 | 625.357428 | 3.104682 | 48.026892 | 106.25 | 4.607548e+09 | 
| 72 | Europe & Central Asia | 1.738094 | 2.191519e+13 | 2518.566323 | 7.130694 | 48.653599 | 16.75 | 9.032073e+08 | 
| 61 | East Asia & Pacific | 1.781424 | 2.168128e+13 | 835.974259 | 4.687596 | 47.212490 | 63.00 | 2.265974e+09 | 
| 251 | Upper middle income | 1.795244 | 2.097441e+13 | 870.897512 | 3.358153 | 47.112001 | 43.25 | 2.540966e+09 | 
| 177 | North America | 1.834404 | 1.908336e+13 | 8615.535450 | 8.066182 | 48.708683 | 13.50 | 3.541404e+08 | 
| 128 | Late-demographic dividend | 1.667820 | 1.862014e+13 | 838.259199 | 3.335746 | 47.046383 | 36.00 | 2.235352e+09 | 
| 250 | United States | 1.860875 | 1.736912e+13 | 9060.068657 | 8.121961 | 48.758830 | 14.00 | 3.185582e+08 | 
| 75 | European Union | 1.570012 | 1.731910e+13 | 3448.910224 | 7.816628 | 48.658777 | 8.00 | 5.082110e+08 | 
| 71 | Euro area | 1.551004 | 1.255692e+13 | 3913.466364 | 7.956080 | 48.610030 | 6.50 | 3.384615e+08 | 
| 63 | East Asia & Pacific (excluding high income) | 1.813950 | 1.242383e+13 | 558.702327 | 2.815498 | 47.115173 | 66.75 | 2.022090e+09 | 
| 62 | East Asia & Pacific (IDA & IBRD) | 1.811850 | 1.239991e+13 | 558.711100 | 2.815573 | 47.098031 | 66.75 | 1.996942e+09 | 
| 60 | Early-demographic dividend | 2.636376 | 1.019283e+13 | 392.428268 | 2.595967 | 48.651143 | 169.00 | 3.083697e+09 | 
| 44 | China | 1.558750 | 1.018279e+13 | 657.748859 | 3.015530 | 46.297964 | 28.75 | 1.364446e+09 | 
| 142 | Lower middle income | 2.877481 | 5.929275e+12 | 254.618642 | 1.657578 | 48.632937 | 262.50 | 2.927329e+09 | 
| 129 | Latin America & Caribbean | 2.125164 | 5.845138e+12 | 1069.141865 | 3.659621 | 48.378604 | 70.75 | 6.242184e+08 | 
| 130 | Latin America & Caribbean (IDA & IBRD) | 2.138122 | 5.636800e+12 | 1050.010897 | 3.585924 | 48.382980 | 71.50 | 6.080342e+08 | 
| 131 | Latin America & Caribbean (excluding high income) | 2.140742 | 5.377437e+12 | 1046.116268 | 3.638974 | 48.379046 | 72.75 | 5.969302e+08 | 
| 117 | Japan | 1.430000 | 5.106025e+12 | 3687.126279 | 8.496074 | 48.744199 | 5.75 | 1.272971e+08 | 
| 73 | Europe & Central Asia (IDA & IBRD) | 1.861440 | 4.213635e+12 | 1173.135713 | 3.878399 | 48.632033 | 24.00 | 4.505588e+08 | 
| 74 | Europe & Central Asia (excluding high income) | 1.909466 | 3.710324e+12 | 1138.169165 | 3.794867 | 48.619737 | 25.50 | 4.125489e+08 | 
| 85 | Germany | 1.450000 | 3.601226e+12 | 4909.659884 | 8.542615 | 48.568695 | 6.25 | 8.128164e+07 | 
| 157 | Middle East & North Africa | 2.852531 | 3.380647e+12 | 912.491552 | 3.055660 | 47.621273 | 83.50 | 4.208876e+08 | 
| 249 | United Kingdom | 1.842500 | 2.768864e+12 | 3357.983675 | 7.720655 | 48.791809 | 9.25 | 6.464156e+07 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 254 | Vanuatu | 3.364750 | 7.828760e+08 | 125.568712 | 3.689874 | 47.301617 | 82.50 | 2.588964e+05 | 
| 224 | St. Vincent and the Grenadines | 1.986000 | 7.301068e+08 | 775.803386 | 4.365757 | 48.536415 | 45.75 | 1.094206e+05 | 
| 3 | American Samoa | NaN | 6.405000e+08 | NaN | NaN | NaN | NaN | 5.542200e+04 | 
| 46 | Comoros | 4.524000 | 6.039190e+08 | 99.221938 | 2.289039 | 47.515774 | 349.50 | 7.595556e+05 | 
| 58 | Dominica | NaN | 5.130350e+08 | 572.413734 | 3.749851 | 48.774225 | NaN | 7.278540e+04 | 
| 239 | Tonga | 3.745750 | 4.391789e+08 | 250.962504 | 3.987285 | 47.697931 | 129.25 | 1.059094e+05 | 
| 156 | Micronesia, Fed. Sts. | 3.269250 | 3.193208e+08 | 456.403851 | 12.037277 | 48.084030 | 103.25 | 1.041180e+05 | 
| 202 | Sao Tome and Principe | 4.603750 | 3.145400e+08 | 304.237533 | 3.227125 | 48.664169 | 159.50 | 1.913326e+05 | 
| 185 | Palau | 2.220000 | 2.548400e+08 | 1389.993077 | 6.623460 | 46.425690 | NaN | 2.111260e+04 | 
| 152 | Marshall Islands | NaN | 1.843189e+08 | 657.059335 | 14.302529 | 48.397282 | NaN | 5.288280e+04 | 
| 121 | Kiribati | 3.745250 | 1.774306e+08 | 179.372383 | 8.279184 | 49.557255 | 95.00 | 1.104816e+05 | 
| 169 | Nauru | NaN | 1.063908e+08 | 611.020856 | 4.527932 | 49.409439 | NaN | 1.169540e+04 | 
| 245 | Tuvalu | NaN | 3.646999e+07 | 563.500592 | 15.506929 | 47.472414 | NaN | 1.091000e+04 | 
| 10 | Aruba | 1.663250 | NaN | NaN | NaN | 48.721939 | NaN | 1.037444e+05 | 
| 28 | British Virgin Islands | NaN | NaN | NaN | NaN | 47.581520 | NaN | 2.958540e+04 | 
| 38 | Cayman Islands | NaN | NaN | NaN | NaN | 49.866695 | NaN | 5.915880e+04 | 
| 42 | Channel Islands | 1.462250 | NaN | NaN | NaN | NaN | NaN | 1.629612e+05 | 
| 53 | Curacao | 2.050000 | NaN | NaN | NaN | 47.986740 | NaN | 1.559594e+05 | 
| 68 | Eritrea | 4.324500 | NaN | 47.155226 | 1.429902 | 46.331369 | 524.50 | NaN | 
| 81 | French Polynesia | 2.051000 | NaN | NaN | NaN | NaN | NaN | 2.757226e+05 | 
| 87 | Gibraltar | NaN | NaN | NaN | NaN | NaN | NaN | 3.402560e+04 | 
| 122 | Korea, Dem. People's Rep. | 1.982500 | NaN | NaN | NaN | 49.030281 | 86.25 | 2.511378e+07 | 
| 137 | Libya | 2.485750 | NaN | 955.388861 | 3.231360 | NaN | 9.00 | 6.225309e+06 | 
| 162 | Monaco | NaN | NaN | 6421.678715 | 3.720085 | 49.637143 | NaN | 3.813840e+04 | 
| 172 | New Caledonia | 2.250000 | NaN | NaN | NaN | NaN | NaN | 2.682000e+05 | 
| 201 | San Marino | 1.260000 | NaN | 3437.298747 | 5.752693 | 45.616261 | NaN | 3.260740e+04 | 
| 209 | Sint Maarten (Dutch part) | NaN | NaN | NaN | NaN | 49.508551 | NaN | 3.755220e+04 | 
| 223 | St. Martin (French part) | 1.812500 | NaN | NaN | NaN | NaN | NaN | 3.149120e+04 | 
| 233 | Syrian Arab Republic | 2.967750 | NaN | 269.945739 | 1.507166 | 48.047394 | 62.00 | 1.931967e+07 | 
| 244 | Turks and Caicos Islands | NaN | NaN | NaN | NaN | 48.846884 | NaN | 3.370340e+04 | 
263 rows × 8 columns
As you might have guessed by now, Series also have a sort_values method.  For a Series, you don’t have to specify the column to sort from, because you are using the Series values.
gdp.sort_values(ascending=False)
259    7.613006e+13
98     4.884635e+13
180    4.787743e+13
193    4.541806e+13
103    2.803020e+13
140    2.724634e+13
160    2.691014e+13
102    2.607726e+13
72     2.191519e+13
61     2.168128e+13
251    2.097441e+13
177    1.908336e+13
128    1.862014e+13
250    1.736912e+13
75     1.731910e+13
71     1.255692e+13
63     1.242383e+13
62     1.239991e+13
60     1.019283e+13
44     1.018279e+13
142    5.929275e+12
129    5.845138e+12
130    5.636800e+12
131    5.377437e+12
117    5.106025e+12
73     4.213635e+12
74     3.710324e+12
85     3.601226e+12
157    3.380647e+12
249    2.768864e+12
           ...     
254    7.828760e+08
224    7.301068e+08
3      6.405000e+08
46     6.039190e+08
58     5.130350e+08
239    4.391789e+08
156    3.193208e+08
202    3.145400e+08
185    2.548400e+08
152    1.843189e+08
121    1.774306e+08
169    1.063908e+08
245    3.646999e+07
10              NaN
28              NaN
38              NaN
42              NaN
53              NaN
68              NaN
81              NaN
87              NaN
122             NaN
137             NaN
162             NaN
172             NaN
201             NaN
209             NaN
223             NaN
233             NaN
244             NaN
Name: gdp, Length: 263, dtype: float64