Pandas indexing reprise

Download notebook Interact

Pandas indexing reprise

This page is a reminder on indexing for Pandas data frames and Series.

You have already seen the basics of Pandas indexing; this page is just a reminder of the later parts of the basic indexing page.

import pandas as pd

We use the familiar dataset on student ratings of professors. It is a table where the rows are course subjects and the columns include average ratings for all University professors / lecturers teaching that subject. See the dataset page for more detail.

# Load the dataset as a data frame
ratings = pd.read_csv('rate_my_course.csv')
# Reorder by Easiness
ratings_by_easy = ratings.sort_values('Easiness', ascending=False)
# Make a smaller data frame with the first six rows
top_by_easy = ratings_by_easy.head(6)
# Show the smaller data frame.
top_by_easy
Discipline Number of Professors Clarity Helpfulness Overall Quality Easiness
64 Reading 148 4.159392 4.188919 4.177500 3.882635
36 Physical Education 991 4.078698 4.030797 4.057719 3.832250
49 Speech 351 4.133191 4.101197 4.119345 3.674701
61 Child Development 171 3.950585 4.008070 3.979766 3.606082
31 Theater 1078 3.876633 3.821503 3.851837 3.584508
19 Music 2455 3.844509 3.787804 3.818114 3.542273

Here is an example Boolean Series that has True for rows where the “Clarity” rating was greater than 4.1, and False otherwise.

is_clear = top_by_easy['Clarity'] > 4.1
is_clear
64     True
36    False
49     True
61    False
31    False
19    False
Name: Clarity, dtype: bool

We will use that in the examples below.

Direct indexing

Direct indexing is where the indexing bracket [ goes right after the data frame. Examples are:

# Direct indexing with a column name.
top_by_easy['Discipline']
64               Reading
36    Physical Education
49                Speech
61     Child Development
31               Theater
19                 Music
Name: Discipline, dtype: object
# Direct indexing with a Boolean sequence.
top_by_easy[is_clear]
Discipline Number of Professors Clarity Helpfulness Overall Quality Easiness
64 Reading 148 4.159392 4.188919 4.177500 3.882635
49 Speech 351 4.133191 4.101197 4.119345 3.674701

As you have seen in the Pandas indexing page, the examples above are the two types of safe direct indexing into Pandas data frames:

  1. Direct indexing with a column name.
  2. Direct indexing with a Boolean sequence.

Indirect indexing by position with iloc

Indirect indexing is where we use the special .iloc and .loc attributes of data frames and Series. The data frame or series goes first, followed by .iloc or .loc, followed by the opening square bracket [, the specifiers for the values we want, and the closing square bracket ].

.iloc selects rows and columns by position. For example, here we ask for the first three rows:

top_by_easy.iloc[:3]
Discipline Number of Professors Clarity Helpfulness Overall Quality Easiness
64 Reading 148 4.159392 4.188919 4.177500 3.882635
36 Physical Education 991 4.078698 4.030797 4.057719 3.832250
49 Speech 351 4.133191 4.101197 4.119345 3.674701

If we send .iloc two arguments, separated by commas, then the first argument refers to the rows, and the second to the columns. Here we ask for the first three rows and the first three columns:

top_by_easy.iloc[:3, :3]
Discipline Number of Professors Clarity
64 Reading 148 4.159392
36 Physical Education 991 4.078698
49 Speech 351 4.133191

We can use : to select everything. For example, this selects all rows, and the last column:

clarity_with_iloc = top_by_easy.iloc[:, -1]
clarity_with_iloc
64    3.882635
36    3.832250
49    3.674701
61    3.606082
31    3.584508
19    3.542273
Name: Easiness, dtype: float64

Indirect indexing by label with .loc

We can also select items by their row and column labels. In this case, the row labels are also counting numbers (integers), so they are easily mistaken for positions if you are not careful.

row_labeled_64 = top_by_easy.loc[64]
row_labeled_64
Discipline              Reading
Number of Professors        148
Clarity                 4.15939
Helpfulness             4.18892
Overall Quality          4.1775
Easiness                3.88264
Name: 64, dtype: object

This is a different result than the one we get from iloc, which does look at position rather than label:

row_position_0 = top_by_easy.iloc[0]
row_position_0
Discipline              Reading
Number of Professors        148
Clarity                 4.15939
Helpfulness             4.18892
Overall Quality          4.1775
Easiness                3.88264
Name: 64, dtype: object

We can ask for multiple rows by label:

ratings_by_label = top_by_easy.loc[[64, 49, 31]]
ratings_by_label
Discipline Number of Professors Clarity Helpfulness Overall Quality Easiness
64 Reading 148 4.159392 4.188919 4.177500 3.882635
49 Speech 351 4.133191 4.101197 4.119345 3.674701
31 Theater 1078 3.876633 3.821503 3.851837 3.584508

If we send .loc two arguments, separated by commas, then the first argument refers to the rows, and the second to the columns. The column labels are the column names. Here we ask for the rows labeled 64, 49, 31, and the column labeled “Discipline”:

ratings_by_row_col_label = top_by_easy.loc[[64, 49, 31], 'Discipline']
ratings_by_row_col_label
64    Reading
49     Speech
31    Theater
Name: Discipline, dtype: object

If we want multiple columns we can pass a list of column names:

ratings_by_row_col_label = top_by_easy.loc[[64, 49, 31], ['Discipline', 'Clarity']]
ratings_by_row_col_label
Discipline Clarity
64 Reading 4.159392
49 Speech 4.133191
31 Theater 3.876633

This is a good way of selecting a subset of the columns from the data frame, using : to select all the rows:

some_columns = top_by_easy.loc[:, ['Discipline', 'Easiness']]
some_columns
Discipline Easiness
64 Reading 3.882635
36 Physical Education 3.832250
49 Speech 3.674701
61 Child Development 3.606082
31 Theater 3.584508
19 Music 3.542273

You can use Boolean sequences to select rows with .loc.

clear_clarity = top_by_easy.loc[is_clear, 'Clarity']
clear_clarity
64    4.159392
49    4.133191
Name: Clarity, dtype: float64

You can also use some Boolean sequences for .iloc, but it’s a bit more complicated. See Booleans and labels for more detail.