Indexing in Pandas¶
Way back in the introduction to data frames, we covered indexing into data frames with Boolean vectors, such as Pandas Series or Numpy arrays, and indexing into data frames with column names.
We call this type of indexing simple indexing
This page covers more sophisticated indexing in Pandas - advanced indexing.
To understand advanced indexing, we first need to cover the index of data frames and Series. The index contains the row labels for a data frame or series.
We start with our usual imports:
# Load the Numpy library, rename to "np"
import numpy as np
# Load the Pandas data science library, rename to "pd"
import pandas as pd
# Safe settings for Pandas.
pd.set_option('mode.chained_assignment', 'raise')
The index¶
We return to our first data frame, containing statistics on development and gender from the World Bank.
This is the gender_stats.csv
file.
See the gender stats data
page for
some more detail.
Download that file to the same directory as this notebook, if you are running on your own computer.
# Load the data file as a data frame.
gender_data = pd.read_csv('gender_stats.csv')
# Show the first 10 rows
gender_data.head(10)
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
0 | Aruba | ABW | 1.66325 | NaN | NaN | NaN | 48.721939 | NaN | 0.103744 |
1 | Afghanistan | AFG | 4.95450 | 19.961015 | 161.138034 | 2.834598 | 40.109708 | 444.00 | 32.715838 |
2 | Angola | AGO | 6.12300 | 111.936542 | 254.747970 | 2.447546 | NaN | 501.25 | 26.937545 |
3 | Albania | ALB | 1.76925 | 12.327586 | 574.202694 | 2.836021 | 47.201082 | 29.25 | 2.888280 |
4 | Andorra | AND | NaN | 3.197538 | 4421.224933 | 7.260281 | 47.123345 | NaN | 0.079547 |
5 | United Arab Emirates | ARE | 1.79300 | 375.027082 | 2202.407569 | 2.581168 | 48.789260 | 6.00 | 9.080299 |
6 | Argentina | ARG | 2.32800 | 550.980968 | 1148.256142 | 2.782216 | 48.915810 | 53.75 | 42.976675 |
7 | Armenia | ARM | 1.54550 | 10.885362 | 348.663884 | 1.916016 | 46.782180 | 27.25 | 2.904683 |
8 | American Samoa | ASM | NaN | 0.640500 | NaN | NaN | NaN | NaN | 0.055422 |
9 | Antigua and Barbuda | ATG | 2.08200 | 1.298213 | 1152.493656 | 3.676514 | 48.291463 | NaN | 0.098872 |
As you know, the names at the top, in bold typeface, are the names of the
columns. We can see these names using the columns
attribute of the data
frame:
gender_data.columns
Index(['country_name', 'country_code', 'fert_rate', 'gdp_us_billion',
'health_exp_per_cap', 'health_exp_pub', 'prim_ed_girls',
'mat_mort_ratio', 'population'],
dtype='object')
These can also be called the column labels.
We have not paid attention so far to the numbers at the left of the table display, also in bold typeface.
In this case, they are ascending integers, starting at 0, with value 0 for the first row, 1 for the second row, and so on.
These numbers are the index of the data frame. The index contains the row labels.
We can see the row labels / index with the index
attribute of the data frame.
gender_data.index
RangeIndex(start=0, stop=216, step=1)
This is saying we have a simple index, that is like a np.arange
, starting at
0, and continuing in steps of 1 up to, but not including 216.
So far, the index looks like row numbers, with 0 for the first row, 1 for the second row. We will soon see that the index is more than that.
For example, let us look at the last 15 rows. We can get these with the tail
method of the data frame.
last_15 = gender_data.tail(15)
last_15
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
201 | Uruguay | URY | 2.027000 | 54.345132 | 1721.507752 | 6.044403 | 48.295555 | 15.50 | 3.419977 |
202 | United States | USA | 1.860875 | 17369.124600 | 9060.068657 | 8.121961 | 48.758830 | 14.00 | 318.558175 |
203 | Uzbekistan | UZB | 2.372750 | 61.340649 | 334.476754 | 3.118842 | 48.387434 | 37.00 | 30.784500 |
204 | St. Vincent and the Grenadines | VCT | 1.986000 | 0.730107 | 775.803386 | 4.365757 | 48.536415 | 45.75 | 0.109421 |
205 | Venezuela, RB | VEN | 2.378250 | 376.146268 | 896.815314 | 1.587088 | 48.400934 | 97.00 | 30.734524 |
206 | British Virgin Islands | VGB | NaN | NaN | NaN | NaN | 47.581520 | NaN | 0.029585 |
207 | Virgin Islands (U.S.) | VIR | 1.760000 | 3.812000 | NaN | NaN | NaN | NaN | 0.104141 |
208 | Vietnam | VNM | 1.959500 | 181.820736 | 368.374550 | 3.779501 | 48.021053 | 54.75 | 90.742400 |
209 | Vanuatu | VUT | 3.364750 | 0.782876 | 125.568712 | 3.689874 | 47.301617 | 82.50 | 0.258896 |
210 | Samoa | WSM | 4.118250 | 0.799887 | 366.353096 | 5.697059 | 48.350049 | 54.75 | 0.192225 |
211 | Kosovo | XKX | 2.142500 | 6.804620 | NaN | NaN | NaN | NaN | 1.813820 |
212 | Yemen, Rep. | YEM | 4.225750 | 36.819337 | 207.949700 | 1.417836 | 44.470076 | 399.75 | 26.246608 |
213 | South Africa | ZAF | 2.375250 | 345.209888 | 1123.142656 | 4.241441 | 48.516298 | 143.75 | 54.177209 |
214 | Zambia | ZMB | 5.394250 | 24.280990 | 185.556359 | 2.687290 | 49.934484 | 233.75 | 15.633220 |
215 | Zimbabwe | ZWE | 3.943000 | 15.495514 | 115.519881 | 2.695188 | 49.529875 | 398.00 | 15.420964 |
Here is the index
of this new data frame.
last_15.index
RangeIndex(start=201, stop=216, step=1)
Notice that the first row of this new data frame no longer has the label 0 - it has label 201. These numbers are the labels from the original data frame, from which we selected the rows.
Notice too that the second row in this table is the USA, and it has the label \202.
Now imagine that I sort the original data frame by Gross Domestic Product, and select the top 15 rows. By doing this, I select the top 15 richest countries.
# Sort by GDP.
df_by_gdp = gender_data.sort_values('gdp_us_billion', ascending=False)
# Take the top 15 rows.
richest_15 = df_by_gdp.head(15)
richest_15
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
202 | United States | USA | 1.860875 | 17369.124600 | 9060.068657 | 8.121961 | 48.758830 | 14.00 | 318.558175 |
35 | China | CHN | 1.558750 | 10182.790479 | 657.748859 | 3.015530 | 46.297964 | 28.75 | 1364.446000 |
97 | Japan | JPN | 1.430000 | 5106.024760 | 3687.126279 | 8.496074 | 48.744199 | 5.75 | 127.297102 |
49 | Germany | DEU | 1.450000 | 3601.226158 | 4909.659884 | 8.542615 | 48.568695 | 6.25 | 81.281645 |
67 | United Kingdom | GBR | 1.842500 | 2768.864417 | 3357.983675 | 7.720655 | 48.791809 | 9.25 | 64.641557 |
63 | France | FRA | 2.005000 | 2647.649725 | 4387.835406 | 8.920420 | 48.772050 | 8.75 | 66.302099 |
26 | Brazil | BRA | 1.795250 | 2198.765606 | 1303.199104 | 3.773473 | 47.784577 | 49.50 | 204.159544 |
88 | India | IND | 2.449250 | 2019.005411 | 241.572477 | 1.292666 | 49.497234 | 185.25 | 1293.742537 |
94 | Italy | ITA | 1.390000 | 2005.983980 | 3266.984094 | 6.984374 | 48.407573 | 4.00 | 60.378795 |
164 | Russian Federation | RUS | 1.724500 | 1822.691700 | 1755.506635 | 3.731354 | 48.968070 | 25.25 | 143.793504 |
32 | Canada | CAN | 1.600300 | 1708.473627 | 4616.539134 | 7.546247 | 48.808926 | 7.25 | 35.517119 |
10 | Australia | AUS | 1.861500 | 1422.994116 | 4256.058988 | 6.292381 | 48.576707 | 6.00 | 23.444560 |
104 | Korea, Rep. | KOR | 1.232000 | 1346.751162 | 2385.447251 | 3.915606 | 48.023388 | 12.00 | 50.727212 |
58 | Spain | ESP | 1.307500 | 1299.724261 | 2963.832825 | 6.545739 | 48.722231 | 5.00 | 46.553128 |
124 | Mexico | MEX | 2.257000 | 1188.802780 | 1081.208948 | 3.225839 | 48.906296 | 40.00 | 124.203450 |
Now notice that the index values have followed their respective rows. The row for the USA is now first, but it has kept its original index value (row label) - 202. The index value identifies the row.
Here is the index for this small table:
richest_15.index
Int64Index([202, 35, 97, 49, 67, 63, 26, 88, 94, 164, 32, 10, 104, 58, 124], dtype='int64')
Now let us say that I wanted to get the eleventh row of this table.
I would be tempted to do something like richest_15[10]
- but then - what
would Pandas give me? Does 10
mean the row label for “Australia”? Or does
it mean the eleventh row of the table, as it would for an array, and therefore
the row for “Canada”? That is, will the indexing using the labels (index
values)? Or will it use the positions (as for indexing in Numpy arrays)?
The answer is, it is ambiguous, and when things are ambiguous, Pandas has to guess. That’s a bad idea, because then we have to guess what Pandas has guessed, and it is starting to get really confusing.
For that reason, we careful Pandas users do not index rows with code like
richest_15[10]
. When we are not using Booleans to select rows, we always use
one of the two indexing attributes that Pandas provides. Each does a different
type of indexing.
The
loc
(label locate) attribute indexes using the labels (row index).The
iloc
(position locate) attribute indexes using the positions.
Here they are in action:
# Get the row with label 10
richest_15.loc[10]
country_name Australia
country_code AUS
fert_rate 1.8615
gdp_us_billion 1422.994116
health_exp_per_cap 4256.058988
health_exp_pub 6.292381
prim_ed_girls 48.576707
mat_mort_ratio 6.0
population 23.44456
Name: 10, dtype: object
# Get the row at position / offset 10
richest_15.iloc[10]
country_name Canada
country_code CAN
fert_rate 1.6003
gdp_us_billion 1708.473627
health_exp_per_cap 4616.539134
health_exp_pub 7.546247
prim_ed_girls 48.808926
mat_mort_ratio 7.25
population 35.517119
Name: 32, dtype: object
In more details, what is happening in the first cell above is that
richest_15.loc
gives us a special value (AKA object) that:
knows about the
richest_15
data frame;will give us rows and columns, if we use
[ something ]
on it.
So, read richest_15.loc[10]
as:
Get the
richest_15
value (the data frame);Get the
loc
attribute of this data frame. This evaluates to a special object (value) that can give us rows and columns from the data frame, by looking at labels;Use this object to fetch the row labeled 10, by using indexing (
[10]
)
Likewise, read richest_15.iloc[10]
as:
Get the
richest_15
value (the data frame);Get the
iloc
attribute of this data frame. This evaluates to a special object (value) that can give us rows and columns from the data frame, by looking at positions (offsets);Use this object to fetch the row at offset 10, by using indexing (
[10]
).
When you select rows and columns from a data frame, consider carefully what you want.
Do you want to select rows and columns by label (by index, in the rows).
Then you want .loc
.
Do you want to select rows and columns by position (offset). Then you want
.iloc
.
Loc and iloc indexing work on columns too¶
Consider this. We use .loc
and the label for the row, followed by the
label for the column.
richest_15.loc[10, 'gdp_us_billion']
1422.994116449394
This selects the row labeled 10 (the USA row) and returns the value from the
gdp_us_billion
column.
Conversely, iloc
only understands positions, and so, integers. It does not
understand or use labels:
# We can't specify columns by label, using iloc.
richest_15.iloc[10, 'gdp_us_billion']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
722 try:
--> 723 self._validate_key(k, i)
724 except ValueError as err:
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
1374 else:
-> 1375 raise ValueError(f"Can only index by location with a [{self._valid_types}]")
1376
ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array]
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-12-3b068730429a> in <module>
1 # We can't specify columns by label, using iloc.
----> 2 richest_15.iloc[10, 'gdp_us_billion']
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in __getitem__(self, key)
887 # AttributeError for IntervalTree get_value
888 return self.obj._get_value(*key, takeable=self._takeable)
--> 889 return self._getitem_tuple(key)
890 else:
891 # we by definition only have the 0th axis
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
1448 def _getitem_tuple(self, tup: Tuple):
1449
-> 1450 self._has_valid_tuple(tup)
1451 with suppress(IndexingError):
1452 return self._getitem_lowerdim(tup)
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
723 self._validate_key(k, i)
724 except ValueError as err:
--> 725 raise ValueError(
726 "Location based indexing can only have "
727 f"[{self._valid_types}] types"
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
Notice the somewhat informative error right at the bottom of the error message.
iloc
can get columns by their position. For example, fertility_rate
is the
third column, and so we can get the fertility rate by asking iloc
for the
eleventh row, third column, like so:
richest_15.iloc[10, 2]
1.6002999999999998
This is the value from the row at offset 10 and the column at offset 2.
Series have an index too¶
Data frames have columns (and column labels), and rows (with row labels — the index).
Series also have element labels — they also have an index.
Consider this column, of the fertility rates of the richest countries:
# Get the fert_rate column from the data frame
rich_ferts = richest_15['fert_rate']
rich_ferts.head()
202 1.860875
35 1.558750
97 1.430000
49 1.450000
67 1.842500
Name: fert_rate, dtype: float64
As y’all know, this is a Series:
type(rich_ferts)
pandas.core.series.Series
You probably also spotted the numbers to the left of the values. Yes, the row labels have come across from the data frame into the Series. We can see that the first element has row label 202, and we remember that this is the row label (index value) for the USA.
As for the data frames, here is the index for the series:
rich_ferts.index
Int64Index([202, 35, 97, 49, 67, 63, 26, 88, 94, 164, 32, 10, 104, 58, 124], dtype='int64')
Like data frames, we can index by label, with loc
, and by position, with
iloc
:
# The value for row labeled 10 - Australia.
rich_ferts.loc[10]
1.8615
# The value for the row at position 10 - Canada
rich_ferts.iloc[10]
1.6002999999999998
Rows from data frames are Series¶
Here we select the row at position 0 (the first row):
first_row = richest_15.iloc[0]
first_row
country_name United States
country_code USA
fert_rate 1.860875
gdp_us_billion 17369.1246
health_exp_per_cap 9060.068657
health_exp_pub 8.121961
prim_ed_girls 48.75883
mat_mort_ratio 14.0
population 318.558175
Name: 202, dtype: object
This is a Series:
type(first_row)
pandas.core.series.Series
Notice that the Series has 9 elements, one for each column in the data frame.
As you remember, we can see the number of rows and columns in a data frame with:
richest_15.shape
(15, 9)
Remember, the columns have names, which we can also call labels.
When we select a row out of a data frame, we have one element per column, and the labels for the columns now become the labels for the rows.
first_row.index
Index(['country_name', 'country_code', 'fert_rate', 'gdp_us_billion',
'health_exp_per_cap', 'health_exp_pub', 'prim_ed_girls',
'mat_mort_ratio', 'population'],
dtype='object')
As for any other Series, we can get elements by label, with loc
:
first_row.loc['fert_rate']
1.860875
We can also get elements by position (offset), with iloc
:
first_row.iloc[0]
'United States'
You can index without loc and iloc, but please don’t, except …¶
In the data frame introduction, you saw that we can
index into data frame directly, using [ something ]
after the data frame
value, without using loc
or iloc
. Call this direct indexing. This is
a very reasonable thing to do if:
the
something
is a Boolean sequence. This is not ambiguous.the
something
is a column name. This isn’t ambiguous either.
We used both types of direct indexing in the data frame introduction page.
For example, both of these are entirely reasonable uses of direct indexing:
# Direct Boolean indexing:
# Select rows where 'health_exp_per_cap' is greater than 3000 dollars.
is_big_spender = richest_15['health_exp_per_cap'] > 3000 # Boolean Series
big_health_spenders = richest_15[is_big_spender]
big_health_spenders
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
202 | United States | USA | 1.860875 | 17369.124600 | 9060.068657 | 8.121961 | 48.758830 | 14.00 | 318.558175 |
97 | Japan | JPN | 1.430000 | 5106.024760 | 3687.126279 | 8.496074 | 48.744199 | 5.75 | 127.297102 |
49 | Germany | DEU | 1.450000 | 3601.226158 | 4909.659884 | 8.542615 | 48.568695 | 6.25 | 81.281645 |
67 | United Kingdom | GBR | 1.842500 | 2768.864417 | 3357.983675 | 7.720655 | 48.791809 | 9.25 | 64.641557 |
63 | France | FRA | 2.005000 | 2647.649725 | 4387.835406 | 8.920420 | 48.772050 | 8.75 | 66.302099 |
94 | Italy | ITA | 1.390000 | 2005.983980 | 3266.984094 | 6.984374 | 48.407573 | 4.00 | 60.378795 |
32 | Canada | CAN | 1.600300 | 1708.473627 | 4616.539134 | 7.546247 | 48.808926 | 7.25 | 35.517119 |
10 | Australia | AUS | 1.861500 | 1422.994116 | 4256.058988 | 6.292381 | 48.576707 | 6.00 | 23.444560 |
# Direct indexing with column label:
# Select the 'country_code' column.
country_code = richest_15['country_code']
country_code
202 USA
35 CHN
97 JPN
49 DEU
67 GBR
63 FRA
26 BRA
88 IND
94 ITA
164 RUS
32 CAN
10 AUS
104 KOR
58 ESP
124 MEX
Name: country_code, dtype: object
Those uses of direct indexing are entirely sensible. Other uses of direct indexing are not sensible, and we strongly encourage you not to use them.
For example, Pandas will also allow you to this, with the Series we just made:
# Nooooooo !
country_code[10]
'AUS'
Please don’t do that. This is ambiguous. Did you really mean the country with label 10? Or did you mean the country at position 10? Pandas will guess for you. Don’t let it. If you are not selecting rows / elements with Booleans, always do this:
# By label
country_code.loc[10]
'AUS'
or this:
# By position
country_code.iloc[10]
'CAN'
You can use Boolean indexing with loc¶
It is often convenient to use direct indexing with Booleans, as we have just seen:
# Create a Boolean series with True for big spender rows, False otherwise.
is_big_spender = richest_15['health_exp_per_cap'] > 3000
is_big_spender
202 True
35 False
97 True
49 True
67 True
63 True
26 False
88 False
94 True
164 False
32 True
10 True
104 False
58 False
124 False
Name: health_exp_per_cap, dtype: bool
# Direct Boolean indexing:
# Select rows where 'health_exp_per_cap' is greater than 3000 dollars.
big_health_spenders = richest_15[is_big_spender]
big_health_spenders
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
202 | United States | USA | 1.860875 | 17369.124600 | 9060.068657 | 8.121961 | 48.758830 | 14.00 | 318.558175 |
97 | Japan | JPN | 1.430000 | 5106.024760 | 3687.126279 | 8.496074 | 48.744199 | 5.75 | 127.297102 |
49 | Germany | DEU | 1.450000 | 3601.226158 | 4909.659884 | 8.542615 | 48.568695 | 6.25 | 81.281645 |
67 | United Kingdom | GBR | 1.842500 | 2768.864417 | 3357.983675 | 7.720655 | 48.791809 | 9.25 | 64.641557 |
63 | France | FRA | 2.005000 | 2647.649725 | 4387.835406 | 8.920420 | 48.772050 | 8.75 | 66.302099 |
94 | Italy | ITA | 1.390000 | 2005.983980 | 3266.984094 | 6.984374 | 48.407573 | 4.00 | 60.378795 |
32 | Canada | CAN | 1.600300 | 1708.473627 | 4616.539134 | 7.546247 | 48.808926 | 7.25 | 35.517119 |
10 | Australia | AUS | 1.861500 | 1422.994116 | 4256.058988 | 6.292381 | 48.576707 | 6.00 | 23.444560 |
We can also use Booleans with loc
indexing. This allows us to do things such
as selecting rows with Booleans, and columns with labels, in one shot, like
this:
# Boolean indexing for rows, with "loc"
# Get "country_name" column values for big health spenders:
big_spender_names = richest_15.loc[is_big_spender, 'country_name']
big_spender_names
202 United States
97 Japan
49 Germany
67 United Kingdom
63 France
94 Italy
32 Canada
10 Australia
Name: country_name, dtype: object
In fact you can use some Boolean sequences with iloc
as well, but that needs
a little more explanation.