Indexing in Pandas

Way back in the introduction to data frames, we covered indexing into data frames with Boolean vectors, such as Pandas Series or Numpy arrays, and indexing into data frames with column names.

We call this type of indexing simple indexing

This page covers more sophisticated indexing in Pandas - advanced indexing.

To understand advanced indexing, we first need to cover the index of data frames and Series. The index contains the row labels for a data frame or series.

We start with our usual imports:

# Load the Numpy library, rename to "np"
import numpy as np

# Load the Pandas data science library, rename to "pd"
import pandas as pd

# Safe settings for Pandas.
pd.set_option('mode.chained_assignment', 'raise')

The index

We return to our first data frame, containing statistics on development and gender from the World Bank.

This is the gender_stats.csv file. See the gender stats data page for some more detail.

Download that file to the same directory as this notebook, if you are running on your own computer.

# Load the data file as a data frame.
gender_data = pd.read_csv('gender_stats.csv')

# Show the first 10 rows
gender_data.head(10)
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
0 Aruba ABW 1.66325 NaN NaN NaN 48.721939 NaN 0.103744
1 Afghanistan AFG 4.95450 19.961015 161.138034 2.834598 40.109708 444.00 32.715838
2 Angola AGO 6.12300 111.936542 254.747970 2.447546 NaN 501.25 26.937545
3 Albania ALB 1.76925 12.327586 574.202694 2.836021 47.201082 29.25 2.888280
4 Andorra AND NaN 3.197538 4421.224933 7.260281 47.123345 NaN 0.079547
5 United Arab Emirates ARE 1.79300 375.027082 2202.407569 2.581168 48.789260 6.00 9.080299
6 Argentina ARG 2.32800 550.980968 1148.256142 2.782216 48.915810 53.75 42.976675
7 Armenia ARM 1.54550 10.885362 348.663884 1.916016 46.782180 27.25 2.904683
8 American Samoa ASM NaN 0.640500 NaN NaN NaN NaN 0.055422
9 Antigua and Barbuda ATG 2.08200 1.298213 1152.493656 3.676514 48.291463 NaN 0.098872

As you know, the names at the top, in bold typeface, are the names of the columns. We can see these names using the columns attribute of the data frame:

gender_data.columns
Index(['country_name', 'country_code', 'fert_rate', 'gdp_us_billion',
       'health_exp_per_cap', 'health_exp_pub', 'prim_ed_girls',
       'mat_mort_ratio', 'population'],
      dtype='object')

These can also be called the column labels.

We have not paid attention so far to the numbers at the left of the table display, also in bold typeface.

In this case, they are ascending integers, starting at 0, with value 0 for the first row, 1 for the second row, and so on.

These numbers are the index of the data frame. The index contains the row labels.

We can see the row labels / index with the index attribute of the data frame.

gender_data.index
RangeIndex(start=0, stop=216, step=1)

This is saying we have a simple index, that is like a np.arange, starting at 0, and continuing in steps of 1 up to, but not including 216.

So far, the index looks like row numbers, with 0 for the first row, 1 for the second row. We will soon see that the index is more than that.

For example, let us look at the last 15 rows. We can get these with the tail method of the data frame.

last_15 = gender_data.tail(15)
last_15
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
201 Uruguay URY 2.027000 54.345132 1721.507752 6.044403 48.295555 15.50 3.419977
202 United States USA 1.860875 17369.124600 9060.068657 8.121961 48.758830 14.00 318.558175
203 Uzbekistan UZB 2.372750 61.340649 334.476754 3.118842 48.387434 37.00 30.784500
204 St. Vincent and the Grenadines VCT 1.986000 0.730107 775.803386 4.365757 48.536415 45.75 0.109421
205 Venezuela, RB VEN 2.378250 376.146268 896.815314 1.587088 48.400934 97.00 30.734524
206 British Virgin Islands VGB NaN NaN NaN NaN 47.581520 NaN 0.029585
207 Virgin Islands (U.S.) VIR 1.760000 3.812000 NaN NaN NaN NaN 0.104141
208 Vietnam VNM 1.959500 181.820736 368.374550 3.779501 48.021053 54.75 90.742400
209 Vanuatu VUT 3.364750 0.782876 125.568712 3.689874 47.301617 82.50 0.258896
210 Samoa WSM 4.118250 0.799887 366.353096 5.697059 48.350049 54.75 0.192225
211 Kosovo XKX 2.142500 6.804620 NaN NaN NaN NaN 1.813820
212 Yemen, Rep. YEM 4.225750 36.819337 207.949700 1.417836 44.470076 399.75 26.246608
213 South Africa ZAF 2.375250 345.209888 1123.142656 4.241441 48.516298 143.75 54.177209
214 Zambia ZMB 5.394250 24.280990 185.556359 2.687290 49.934484 233.75 15.633220
215 Zimbabwe ZWE 3.943000 15.495514 115.519881 2.695188 49.529875 398.00 15.420964

Here is the index of this new data frame.

last_15.index
RangeIndex(start=201, stop=216, step=1)

Notice that the first row of this new data frame no longer has the label 0 - it has label 201. These numbers are the labels from the original data frame, from which we selected the rows.

Notice too that the second row in this table is the USA, and it has the label \202.

Now imagine that I sort the original data frame by Gross Domestic Product, and select the top 15 rows. By doing this, I select the top 15 richest countries.

# Sort by GDP.
df_by_gdp = gender_data.sort_values('gdp_us_billion', ascending=False)
# Take the top 15 rows.
richest_15 = df_by_gdp.head(15)
richest_15
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
202 United States USA 1.860875 17369.124600 9060.068657 8.121961 48.758830 14.00 318.558175
35 China CHN 1.558750 10182.790479 657.748859 3.015530 46.297964 28.75 1364.446000
97 Japan JPN 1.430000 5106.024760 3687.126279 8.496074 48.744199 5.75 127.297102
49 Germany DEU 1.450000 3601.226158 4909.659884 8.542615 48.568695 6.25 81.281645
67 United Kingdom GBR 1.842500 2768.864417 3357.983675 7.720655 48.791809 9.25 64.641557
63 France FRA 2.005000 2647.649725 4387.835406 8.920420 48.772050 8.75 66.302099
26 Brazil BRA 1.795250 2198.765606 1303.199104 3.773473 47.784577 49.50 204.159544
88 India IND 2.449250 2019.005411 241.572477 1.292666 49.497234 185.25 1293.742537
94 Italy ITA 1.390000 2005.983980 3266.984094 6.984374 48.407573 4.00 60.378795
164 Russian Federation RUS 1.724500 1822.691700 1755.506635 3.731354 48.968070 25.25 143.793504
32 Canada CAN 1.600300 1708.473627 4616.539134 7.546247 48.808926 7.25 35.517119
10 Australia AUS 1.861500 1422.994116 4256.058988 6.292381 48.576707 6.00 23.444560
104 Korea, Rep. KOR 1.232000 1346.751162 2385.447251 3.915606 48.023388 12.00 50.727212
58 Spain ESP 1.307500 1299.724261 2963.832825 6.545739 48.722231 5.00 46.553128
124 Mexico MEX 2.257000 1188.802780 1081.208948 3.225839 48.906296 40.00 124.203450

Now notice that the index values have followed their respective rows. The row for the USA is now first, but it has kept its original index value (row label) - 202. The index value identifies the row.

Here is the index for this small table:

richest_15.index
Int64Index([202, 35, 97, 49, 67, 63, 26, 88, 94, 164, 32, 10, 104, 58, 124], dtype='int64')

Now let us say that I wanted to get the eleventh row of this table.

I would be tempted to do something like richest_15[10] - but then - what would Pandas give me? Does 10 mean the row label for “Australia”? Or does it mean the eleventh row of the table, as it would for an array, and therefore the row for “Canada”? That is, will the indexing using the labels (index values)? Or will it use the positions (as for indexing in Numpy arrays)?

The answer is, it is ambiguous, and when things are ambiguous, Pandas has to guess. That’s a bad idea, because then we have to guess what Pandas has guessed, and it is starting to get really confusing.

For that reason, we careful Pandas users do not index rows with code like richest_15[10]. When we are not using Booleans to select rows, we always use one of the two indexing attributes that Pandas provides. Each does a different type of indexing.

  • The loc (label locate) attribute indexes using the labels (row index).

  • The iloc (position locate) attribute indexes using the positions.

Here they are in action:

# Get the row with label 10
richest_15.loc[10]
country_name            Australia
country_code                  AUS
fert_rate                  1.8615
gdp_us_billion        1422.994116
health_exp_per_cap    4256.058988
health_exp_pub           6.292381
prim_ed_girls           48.576707
mat_mort_ratio                6.0
population               23.44456
Name: 10, dtype: object
# Get the row at position / offset 10
richest_15.iloc[10]
country_name               Canada
country_code                  CAN
fert_rate                  1.6003
gdp_us_billion        1708.473627
health_exp_per_cap    4616.539134
health_exp_pub           7.546247
prim_ed_girls           48.808926
mat_mort_ratio               7.25
population              35.517119
Name: 32, dtype: object

In more details, what is happening in the first cell above is that richest_15.loc gives us a special value (AKA object) that:

  • knows about the richest_15 data frame;

  • will give us rows and columns, if we use [ something ] on it.

So, read richest_15.loc[10] as:

  • Get the richest_15 value (the data frame);

  • Get the loc attribute of this data frame. This evaluates to a special object (value) that can give us rows and columns from the data frame, by looking at labels;

  • Use this object to fetch the row labeled 10, by using indexing ([10])

Likewise, read richest_15.iloc[10] as:

  • Get the richest_15 value (the data frame);

  • Get the iloc attribute of this data frame. This evaluates to a special object (value) that can give us rows and columns from the data frame, by looking at positions (offsets);

  • Use this object to fetch the row at offset 10, by using indexing ([10]).

When you select rows and columns from a data frame, consider carefully what you want.

Do you want to select rows and columns by label (by index, in the rows). Then you want .loc.

Do you want to select rows and columns by position (offset). Then you want .iloc.

Loc and iloc indexing work on columns too

Consider this. We use .loc and the label for the row, followed by the label for the column.

richest_15.loc[10, 'gdp_us_billion']
1422.994116449394

This selects the row labeled 10 (the USA row) and returns the value from the gdp_us_billion column.

Conversely, iloc only understands positions, and so, integers. It does not understand or use labels:

# We can't specify columns by label, using iloc.
richest_15.iloc[10, 'gdp_us_billion']
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
    722             try:
--> 723                 self._validate_key(k, i)
    724             except ValueError as err:

~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
   1374         else:
-> 1375             raise ValueError(f"Can only index by location with a [{self._valid_types}]")
   1376 

ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array]

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-12-3b068730429a> in <module>
      1 # We can't specify columns by label, using iloc.
----> 2 richest_15.iloc[10, 'gdp_us_billion']

~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    887                     # AttributeError for IntervalTree get_value
    888                     return self.obj._get_value(*key, takeable=self._takeable)
--> 889             return self._getitem_tuple(key)
    890         else:
    891             # we by definition only have the 0th axis

~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
   1448     def _getitem_tuple(self, tup: Tuple):
   1449 
-> 1450         self._has_valid_tuple(tup)
   1451         with suppress(IndexingError):
   1452             return self._getitem_lowerdim(tup)

~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
    723                 self._validate_key(k, i)
    724             except ValueError as err:
--> 725                 raise ValueError(
    726                     "Location based indexing can only have "
    727                     f"[{self._valid_types}] types"

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

Notice the somewhat informative error right at the bottom of the error message.

iloc can get columns by their position. For example, fertility_rate is the third column, and so we can get the fertility rate by asking iloc for the eleventh row, third column, like so:

richest_15.iloc[10, 2]
1.6002999999999998

This is the value from the row at offset 10 and the column at offset 2.

Series have an index too

Data frames have columns (and column labels), and rows (with row labels — the index).

Series also have element labels — they also have an index.

Consider this column, of the fertility rates of the richest countries:

# Get the fert_rate column from the data frame
rich_ferts = richest_15['fert_rate']
rich_ferts.head()
202    1.860875
35     1.558750
97     1.430000
49     1.450000
67     1.842500
Name: fert_rate, dtype: float64

As y’all know, this is a Series:

type(rich_ferts)
pandas.core.series.Series

You probably also spotted the numbers to the left of the values. Yes, the row labels have come across from the data frame into the Series. We can see that the first element has row label 202, and we remember that this is the row label (index value) for the USA.

As for the data frames, here is the index for the series:

rich_ferts.index
Int64Index([202, 35, 97, 49, 67, 63, 26, 88, 94, 164, 32, 10, 104, 58, 124], dtype='int64')

Like data frames, we can index by label, with loc, and by position, with iloc:

# The value for row labeled 10 - Australia.
rich_ferts.loc[10]
1.8615
# The value for the row at position 10 - Canada
rich_ferts.iloc[10]
1.6002999999999998

Rows from data frames are Series

Here we select the row at position 0 (the first row):

first_row = richest_15.iloc[0]
first_row
country_name          United States
country_code                    USA
fert_rate                  1.860875
gdp_us_billion           17369.1246
health_exp_per_cap      9060.068657
health_exp_pub             8.121961
prim_ed_girls              48.75883
mat_mort_ratio                 14.0
population               318.558175
Name: 202, dtype: object

This is a Series:

type(first_row)
pandas.core.series.Series

Notice that the Series has 9 elements, one for each column in the data frame.

As you remember, we can see the number of rows and columns in a data frame with:

richest_15.shape
(15, 9)

Remember, the columns have names, which we can also call labels.

When we select a row out of a data frame, we have one element per column, and the labels for the columns now become the labels for the rows.

first_row.index
Index(['country_name', 'country_code', 'fert_rate', 'gdp_us_billion',
       'health_exp_per_cap', 'health_exp_pub', 'prim_ed_girls',
       'mat_mort_ratio', 'population'],
      dtype='object')

As for any other Series, we can get elements by label, with loc:

first_row.loc['fert_rate']
1.860875

We can also get elements by position (offset), with iloc:

first_row.iloc[0]
'United States'

You can index without loc and iloc, but please don’t, except …

In the data frame introduction, you saw that we can index into data frame directly, using [ something ] after the data frame value, without using loc or iloc. Call this direct indexing. This is a very reasonable thing to do if:

  • the something is a Boolean sequence. This is not ambiguous.

  • the something is a column name. This isn’t ambiguous either.

We used both types of direct indexing in the data frame introduction page.

For example, both of these are entirely reasonable uses of direct indexing:

# Direct Boolean indexing:
# Select rows where 'health_exp_per_cap' is greater than 3000 dollars.
is_big_spender = richest_15['health_exp_per_cap'] > 3000  # Boolean Series
big_health_spenders = richest_15[is_big_spender]
big_health_spenders
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
202 United States USA 1.860875 17369.124600 9060.068657 8.121961 48.758830 14.00 318.558175
97 Japan JPN 1.430000 5106.024760 3687.126279 8.496074 48.744199 5.75 127.297102
49 Germany DEU 1.450000 3601.226158 4909.659884 8.542615 48.568695 6.25 81.281645
67 United Kingdom GBR 1.842500 2768.864417 3357.983675 7.720655 48.791809 9.25 64.641557
63 France FRA 2.005000 2647.649725 4387.835406 8.920420 48.772050 8.75 66.302099
94 Italy ITA 1.390000 2005.983980 3266.984094 6.984374 48.407573 4.00 60.378795
32 Canada CAN 1.600300 1708.473627 4616.539134 7.546247 48.808926 7.25 35.517119
10 Australia AUS 1.861500 1422.994116 4256.058988 6.292381 48.576707 6.00 23.444560
# Direct indexing with column label:
# Select the 'country_code' column.
country_code = richest_15['country_code']
country_code
202    USA
35     CHN
97     JPN
49     DEU
67     GBR
63     FRA
26     BRA
88     IND
94     ITA
164    RUS
32     CAN
10     AUS
104    KOR
58     ESP
124    MEX
Name: country_code, dtype: object

Those uses of direct indexing are entirely sensible. Other uses of direct indexing are not sensible, and we strongly encourage you not to use them.

For example, Pandas will also allow you to this, with the Series we just made:

# Nooooooo !
country_code[10]
'AUS'

Please don’t do that. This is ambiguous. Did you really mean the country with label 10? Or did you mean the country at position 10? Pandas will guess for you. Don’t let it. If you are not selecting rows / elements with Booleans, always do this:

# By label
country_code.loc[10]
'AUS'

or this:

# By position
country_code.iloc[10]
'CAN'

You can use Boolean indexing with loc

It is often convenient to use direct indexing with Booleans, as we have just seen:

# Create a Boolean series with True for big spender rows, False otherwise.
is_big_spender = richest_15['health_exp_per_cap'] > 3000
is_big_spender
202     True
35     False
97      True
49      True
67      True
63      True
26     False
88     False
94      True
164    False
32      True
10      True
104    False
58     False
124    False
Name: health_exp_per_cap, dtype: bool
# Direct Boolean indexing:
# Select rows where 'health_exp_per_cap' is greater than 3000 dollars.
big_health_spenders = richest_15[is_big_spender]
big_health_spenders
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
202 United States USA 1.860875 17369.124600 9060.068657 8.121961 48.758830 14.00 318.558175
97 Japan JPN 1.430000 5106.024760 3687.126279 8.496074 48.744199 5.75 127.297102
49 Germany DEU 1.450000 3601.226158 4909.659884 8.542615 48.568695 6.25 81.281645
67 United Kingdom GBR 1.842500 2768.864417 3357.983675 7.720655 48.791809 9.25 64.641557
63 France FRA 2.005000 2647.649725 4387.835406 8.920420 48.772050 8.75 66.302099
94 Italy ITA 1.390000 2005.983980 3266.984094 6.984374 48.407573 4.00 60.378795
32 Canada CAN 1.600300 1708.473627 4616.539134 7.546247 48.808926 7.25 35.517119
10 Australia AUS 1.861500 1422.994116 4256.058988 6.292381 48.576707 6.00 23.444560

We can also use Booleans with loc indexing. This allows us to do things such as selecting rows with Booleans, and columns with labels, in one shot, like this:

# Boolean indexing for rows, with "loc"
# Get "country_name" column values for big health spenders:
big_spender_names = richest_15.loc[is_big_spender, 'country_name']
big_spender_names
202     United States
97              Japan
49            Germany
67     United Kingdom
63             France
94              Italy
32             Canada
10          Australia
Name: country_name, dtype: object

In fact you can use some Boolean sequences with iloc as well, but that needs a little more explanation.