Booleans and labels¶
We have already used Boolean series to index data frames and other series.
This page gives a little more details about how that works, in order to explain some subtleties in results.
We return to the World Bank statistics on development and gender in
gender_stats.csv
.
Download that file to the same directory as this notebook, if you are running on your own computer.
Here we pull out the rows for the countries with the highest Gross Domestic Product. This may be familiar from the data frame intro and the Pandas indexing page.
import numpy as np
import pandas as pd
# Safe settings for Pandas.
pd.set_option('mode.chained_assignment', 'raise')
# Load the data file as a data frame.
gender_data = pd.read_csv('gender_stats.csv')
# Sort by GDP.
df_by_gdp = gender_data.sort_values('gdp_us_billion', ascending=False)
# Take the top 5 rows.
richest_5 = df_by_gdp.head(5)
richest_5
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
202 | United States | USA | 1.860875 | 17369.124600 | 9060.068657 | 8.121961 | 48.758830 | 14.00 | 318.558175 |
35 | China | CHN | 1.558750 | 10182.790479 | 657.748859 | 3.015530 | 46.297964 | 28.75 | 1364.446000 |
97 | Japan | JPN | 1.430000 | 5106.024760 | 3687.126279 | 8.496074 | 48.744199 | 5.75 | 127.297102 |
49 | Germany | DEU | 1.450000 | 3601.226158 | 4909.659884 | 8.542615 | 48.568695 | 6.25 | 81.281645 |
67 | United Kingdom | GBR | 1.842500 | 2768.864417 | 3357.983675 | 7.720655 | 48.791809 | 9.25 | 64.641557 |
Consider the index (row labels) of this 5-row data frame:
richest_5.index
Int64Index([202, 35, 97, 49, 67], dtype='int64')
Now let us say we want to select some of these rows with Boolean indexing. Here
is a Boolean series with True for rows where health_exp_per_cap
is greater
than 3500 dollars, False otherwise.
# Create a Boolean series with True for big spender rows, False otherwise.
is_big_spender = richest_5['health_exp_per_cap'] > 3500
is_big_spender
202 True
35 False
97 True
49 True
67 False
Name: health_exp_per_cap, dtype: bool
Notice that this Series, like all Series, has an index, which is the same as
the index from the Series we used to make it
(richest_5['health_exp_per_cap']
), and therefore, the same as the index for
the data frame (richest_5
).
is_big_spender.index
Int64Index([202, 35, 97, 49, 67], dtype='int64')
Let’s say we are interested in the country_name
for countries spending more
than 3500 dollars per capita on health care. We can do this with, for example,
loc
:
richest_5.loc[is_big_spender, 'country_name']
202 United States
97 Japan
49 Germany
Name: country_name, dtype: object
loc
gives us the rows for which corresponding elements in is_big_spender
are True.
But what does corresponding mean? Does it mean corresponding in terms of position? Or in terms of index (row / element labels)?
Here’s the Boolean series again:
is_big_spender
202 True
35 False
97 True
49 True
67 False
Name: health_exp_per_cap, dtype: bool
If corresponding means corresponding in terms of position then we would
take the first row of richest_5
(because the first element in
is_big_spender
is True) but not the second row (because the second element in
is_big_spender
is False) and so on.
If corresponding means corresponding in terms of labels then we would take
the row from richest_5
with label 202 (because the element labeled 202 in
is_big_spender
is True), but we do not take the row labeled 35 (because the
element labeled 35 in is_big_spender
is False), and so on.
We cannot distinguish between these alternatives at the moment, because the
order of the labels is the same in the data frame richest_5
and the series
is_big_spender
, so positions and labels will give the same answer.
We can distinguish when the order of the labels is not the same in the Boolean series and the data frame. For example, we can sort the series, like this:
sorted_is_big_spender = is_big_spender.sort_values()
sorted_is_big_spender
35 False
67 False
202 True
97 True
49 True
Name: health_exp_per_cap, dtype: bool
Now the order of the row labels is different in the series (above) and the data frame (below):
richest_5.index
Int64Index([202, 35, 97, 49, 67], dtype='int64')
If we index with this sorted series, and indexing uses the position of the True and False values, we would expect to see the last three rows, because the values in the last three positions are True (and the others are False).
If indexing uses labels, then we expect to get the same answer we got before, because the relationship between the True / False values and labels hasn’t changed - the elements labeled 202, 97 and 49 have True values, so we expect to see those rows in the result. So (drum roll):
richest_5.loc[sorted_is_big_spender, 'country_name']
202 United States
97 Japan
49 Germany
Name: country_name, dtype: object
The result is the same as for the not-sorted series - and we have found that
loc
indexing uses the labels rather than the positions when indexing with
a Boolean series.
On reflection, maybe that is not surprising - after all, we already
know that loc
indexing does indexing by label.
We usually will not have to worry about the difference between labels and positions in Boolean Series, because, usually, the row labels and positions are the same in the thing we index (data frame, or Series) and the Boolean series we use for indexing. This was the case at the top of this page, and in all previous times you have seen Boolean indexing on data frames or series.
What about iloc?¶
iloc
does indexing by position - so what approach does iloc
take to the
labels on a Boolean series?
We guess that iloc
does not use the labels on a Boolean series, but what does
it do? Does it throw away the labels and just look at the positions of the
True and False values? Or something else?
# iloc indexing with a Boolean series.
# The 'country_name' column is the first column,
# so it is at offset 0 in the columns.
richest_5.iloc[sorted_is_big_spender, 0]
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-11-8f5ae3f2285e> in <module>
2 # The 'country_name' column is the first column,
3 # so it is at offset 0 in the columns.
----> 4 richest_5.iloc[sorted_is_big_spender, 0]
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in __getitem__(self, key)
887 # AttributeError for IntervalTree get_value
888 return self.obj._get_value(*key, takeable=self._takeable)
--> 889 return self._getitem_tuple(key)
890 else:
891 # we by definition only have the 0th axis
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
1448 def _getitem_tuple(self, tup: Tuple):
1449
-> 1450 self._has_valid_tuple(tup)
1451 with suppress(IndexingError):
1452 return self._getitem_lowerdim(tup)
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
721 for i, k in enumerate(key):
722 try:
--> 723 self._validate_key(k, i)
724 except ValueError as err:
725 raise ValueError(
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
1342 if hasattr(key, "index") and isinstance(key.index, Index):
1343 if key.index.inferred_type == "integer":
-> 1344 raise NotImplementedError(
1345 "iLocation based boolean "
1346 "indexing on an integer type "
NotImplementedError: iLocation based boolean indexing on an integer type is not available
It does something else - it gives an error, telling us that it will not use the index. This is so whatever the index order - even if the index and the positions give the same answer:
# iloc with the original Boolean series, where positions
# and labels match.
richest_5.iloc[is_big_spender, 0]
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-12-2f61cb9f0042> in <module>
1 # iloc with the original Boolean series, where positions
2 # and labels match.
----> 3 richest_5.iloc[is_big_spender, 0]
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in __getitem__(self, key)
887 # AttributeError for IntervalTree get_value
888 return self.obj._get_value(*key, takeable=self._takeable)
--> 889 return self._getitem_tuple(key)
890 else:
891 # we by definition only have the 0th axis
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
1448 def _getitem_tuple(self, tup: Tuple):
1449
-> 1450 self._has_valid_tuple(tup)
1451 with suppress(IndexingError):
1452 return self._getitem_lowerdim(tup)
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
721 for i, k in enumerate(key):
722 try:
--> 723 self._validate_key(k, i)
724 except ValueError as err:
725 raise ValueError(
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
1342 if hasattr(key, "index") and isinstance(key.index, Index):
1343 if key.index.inferred_type == "integer":
-> 1344 raise NotImplementedError(
1345 "iLocation based boolean "
1346 "indexing on an integer type "
NotImplementedError: iLocation based boolean indexing on an integer type is not available
If you want to use Boolean indexing with iloc
, you can use a Boolean sequence
without row / element labels, such as a Numpy array. We can achieve that by
converting the Boolean Series to a Numpy array, and therefore, throwing away
the index (element labels):
sorted_is_big_spender_arr = np.array(sorted_is_big_spender)
sorted_is_big_spender_arr
array([False, False, True, True, True])
This Boolean sequence will work with iloc
, because it has no index (element
labels) to trip it up:
# You can use a Boolean array to index rows, with "iloc".
richest_5.iloc[sorted_is_big_spender_arr, 0]
97 Japan
49 Germany
67 United Kingdom
Name: country_name, dtype: object
Notice this does gives us the rows where the matching position has True.