Pandas indexing reprise

Download notebook Interact

Pandas indexing reprise

This page is a reminder on indexing for Pandas data frames and Series.

You have already seen the basics of Pandas indexing; this page is just a reminder of the later parts of the basic indexing page.

import pandas as pd

We use the familiar dataset on student ratings of professors. It is a table where the rows are course subjects and the columns include average ratings for all University professors / lecturers teaching that subject. See the dataset page for more detail.

# Load the dataset as a data frame
ratings = pd.read_csv('rate_my_course.csv')
# Reorder by Easiness
ratings_by_easy = ratings.sort_values('Easiness', ascending=False)
# Make a smaller data frame with the first six rows
top_by_easy = ratings_by_easy.head(6)
# Show the smaller data frame.
top_by_easy

	Discipline	Number of Professors	Clarity	Helpfulness	Overall Quality	Easiness
64	Reading	148	4.159392	4.188919	4.177500	3.882635
36	Physical Education	991	4.078698	4.030797	4.057719	3.832250
49	Speech	351	4.133191	4.101197	4.119345	3.674701
61	Child Development	171	3.950585	4.008070	3.979766	3.606082
31	Theater	1078	3.876633	3.821503	3.851837	3.584508
19	Music	2455	3.844509	3.787804	3.818114	3.542273

Here is an example Boolean Series that has True for rows where the “Clarity” rating was greater than 4.1, and False otherwise.

is_clear = top_by_easy['Clarity'] > 4.1
is_clear

   True
  False
   True
  False
  False
  False
Name: Clarity, dtype: bool

We will use that in the examples below.

Direct indexing

Direct indexing is where the indexing bracket [ goes right after the data frame. Examples are:

# Direct indexing with a column name.
top_by_easy['Discipline']

             Reading
  Physical Education
              Speech
   Child Development
             Theater
               Music
Name: Discipline, dtype: object

# Direct indexing with a Boolean sequence.
top_by_easy[is_clear]

	Discipline	Number of Professors	Clarity	Helpfulness	Overall Quality	Easiness
64	Reading	148	4.159392	4.188919	4.177500	3.882635
49	Speech	351	4.133191	4.101197	4.119345	3.674701

As you have seen in the Pandas indexing page, the examples above are the two types of safe direct indexing into Pandas data frames:

Direct indexing with a column name.
Direct indexing with a Boolean sequence.

Indirect indexing by position with `iloc`

Indirect indexing is where we use the special .iloc and .loc attributes of data frames and Series. The data frame or series goes first, followed by .iloc or .loc, followed by the opening square bracket [, the specifiers for the values we want, and the closing square bracket ].

.iloc selects rows and columns by position. For example, here we ask for the first three rows:

top_by_easy.iloc[:3]

	Discipline	Number of Professors	Clarity	Helpfulness	Overall Quality	Easiness
64	Reading	148	4.159392	4.188919	4.177500	3.882635
36	Physical Education	991	4.078698	4.030797	4.057719	3.832250
49	Speech	351	4.133191	4.101197	4.119345	3.674701

If we send .iloc two arguments, separated by commas, then the first argument refers to the rows, and the second to the columns. Here we ask for the first three rows and the first three columns:

top_by_easy.iloc[:3, :3]

	Discipline	Number of Professors	Clarity
64	Reading	148	4.159392
36	Physical Education	991	4.078698
49	Speech	351	4.133191

We can use : to select everything. For example, this selects all rows, and the last column:

clarity_with_iloc = top_by_easy.iloc[:, -1]
clarity_with_iloc

  3.882635
  3.832250
  3.674701
  3.606082
  3.584508
  3.542273
Name: Easiness, dtype: float64

Indirect indexing by label with `.loc`

We can also select items by their row and column labels. In this case, the row labels are also counting numbers (integers), so they are easily mistaken for positions if you are not careful.

row_labeled_64 = top_by_easy.loc[64]
row_labeled_64

Discipline              Reading
Number of Professors        148
Clarity                 4.15939
Helpfulness             4.18892
Overall Quality          4.1775
Easiness                3.88264
Name: 64, dtype: object

This is a different result than the one we get from iloc, which does look at position rather than label:

row_position_0 = top_by_easy.iloc[0]
row_position_0

Discipline              Reading
Number of Professors        148
Clarity                 4.15939
Helpfulness             4.18892
Overall Quality          4.1775
Easiness                3.88264
Name: 64, dtype: object

We can ask for multiple rows by label:

ratings_by_label = top_by_easy.loc[[64, 49, 31]]
ratings_by_label

	Discipline	Number of Professors	Clarity	Helpfulness	Overall Quality	Easiness
64	Reading	148	4.159392	4.188919	4.177500	3.882635
49	Speech	351	4.133191	4.101197	4.119345	3.674701
31	Theater	1078	3.876633	3.821503	3.851837	3.584508

If we send .loc two arguments, separated by commas, then the first argument refers to the rows, and the second to the columns. The column labels are the column names. Here we ask for the rows labeled 64, 49, 31, and the column labeled “Discipline”:

ratings_by_row_col_label = top_by_easy.loc[[64, 49, 31], 'Discipline']
ratings_by_row_col_label

  Reading
   Speech
  Theater
Name: Discipline, dtype: object

If we want multiple columns we can pass a list of column names:

ratings_by_row_col_label = top_by_easy.loc[[64, 49, 31], ['Discipline', 'Clarity']]
ratings_by_row_col_label

	Discipline	Clarity
64	Reading	4.159392
49	Speech	4.133191
31	Theater	3.876633

This is a good way of selecting a subset of the columns from the data frame, using : to select all the rows:

some_columns = top_by_easy.loc[:, ['Discipline', 'Easiness']]
some_columns

	Discipline	Easiness
64	Reading	3.882635
36	Physical Education	3.832250
49	Speech	3.674701
61	Child Development	3.606082
31	Theater	3.584508
19	Music	3.542273

You can use Boolean sequences to select rows with .loc.

clear_clarity = top_by_easy.loc[is_clear, 'Clarity']
clear_clarity

64    4.159392
49    4.133191
Name: Clarity, dtype: float64

You can also use some Boolean sequences for .iloc, but it’s a bit more complicated. See Booleans and labels for more detail.

Pandas indexing reprise

Direct indexing

Indirect indexing by position with iloc

Indirect indexing by label with .loc

Indirect indexing by position with `iloc`

Indirect indexing by label with `.loc`