Pandas indexing reprise
Pandas indexing reprise
This page is a reminder on indexing for Pandas data frames and Series.
You have already seen the basics of Pandas indexing; this page is just a reminder of the later parts of the basic indexing page.
import pandas as pd
We use the familiar dataset on student ratings of professors. It is a table where the rows are course subjects and the columns include average ratings for all University professors / lecturers teaching that subject. See the dataset page for more detail.
# Load the dataset as a data frame
ratings = pd.read_csv('rate_my_course.csv')
# Reorder by Easiness
ratings_by_easy = ratings.sort_values('Easiness', ascending=False)
# Make a smaller data frame with the first six rows
top_by_easy = ratings_by_easy.head(6)
# Show the smaller data frame.
top_by_easy
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
64 | Reading | 148 | 4.159392 | 4.188919 | 4.177500 | 3.882635 |
36 | Physical Education | 991 | 4.078698 | 4.030797 | 4.057719 | 3.832250 |
49 | Speech | 351 | 4.133191 | 4.101197 | 4.119345 | 3.674701 |
61 | Child Development | 171 | 3.950585 | 4.008070 | 3.979766 | 3.606082 |
31 | Theater | 1078 | 3.876633 | 3.821503 | 3.851837 | 3.584508 |
19 | Music | 2455 | 3.844509 | 3.787804 | 3.818114 | 3.542273 |
Here is an example Boolean Series that has True for rows where the “Clarity” rating was greater than 4.1, and False otherwise.
is_clear = top_by_easy['Clarity'] > 4.1
is_clear
64 True
36 False
49 True
61 False
31 False
19 False
Name: Clarity, dtype: bool
We will use that in the examples below.
Direct indexing
Direct indexing is where the indexing bracket [
goes right after the data
frame. Examples are:
# Direct indexing with a column name.
top_by_easy['Discipline']
64 Reading
36 Physical Education
49 Speech
61 Child Development
31 Theater
19 Music
Name: Discipline, dtype: object
# Direct indexing with a Boolean sequence.
top_by_easy[is_clear]
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
64 | Reading | 148 | 4.159392 | 4.188919 | 4.177500 | 3.882635 |
49 | Speech | 351 | 4.133191 | 4.101197 | 4.119345 | 3.674701 |
As you have seen in the Pandas indexing page, the examples above are the two types of safe direct indexing into Pandas data frames:
- Direct indexing with a column name.
- Direct indexing with a Boolean sequence.
Indirect indexing by position with iloc
Indirect indexing is where we use the special .iloc
and .loc
attributes of data frames and Series. The data frame or series goes first, followed by .iloc
or .loc
, followed by the opening square bracket [
, the specifiers for the values we want, and the closing square bracket ]
.
.iloc
selects rows and columns by position. For example, here we ask for
the first three rows:
top_by_easy.iloc[:3]
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
64 | Reading | 148 | 4.159392 | 4.188919 | 4.177500 | 3.882635 |
36 | Physical Education | 991 | 4.078698 | 4.030797 | 4.057719 | 3.832250 |
49 | Speech | 351 | 4.133191 | 4.101197 | 4.119345 | 3.674701 |
If we send .iloc
two arguments, separated by commas, then the first argument
refers to the rows, and the second to the columns. Here we ask for the first three rows and the first three columns:
top_by_easy.iloc[:3, :3]
Discipline | Number of Professors | Clarity | |
---|---|---|---|
64 | Reading | 148 | 4.159392 |
36 | Physical Education | 991 | 4.078698 |
49 | Speech | 351 | 4.133191 |
We can use :
to select everything. For example, this selects all rows, and
the last column:
clarity_with_iloc = top_by_easy.iloc[:, -1]
clarity_with_iloc
64 3.882635
36 3.832250
49 3.674701
61 3.606082
31 3.584508
19 3.542273
Name: Easiness, dtype: float64
Indirect indexing by label with .loc
We can also select items by their row and column labels. In this case, the row labels are also counting numbers (integers), so they are easily mistaken for positions if you are not careful.
row_labeled_64 = top_by_easy.loc[64]
row_labeled_64
Discipline Reading
Number of Professors 148
Clarity 4.15939
Helpfulness 4.18892
Overall Quality 4.1775
Easiness 3.88264
Name: 64, dtype: object
This is a different result than the one we get from iloc
, which does look at position rather than label:
row_position_0 = top_by_easy.iloc[0]
row_position_0
Discipline Reading
Number of Professors 148
Clarity 4.15939
Helpfulness 4.18892
Overall Quality 4.1775
Easiness 3.88264
Name: 64, dtype: object
We can ask for multiple rows by label:
ratings_by_label = top_by_easy.loc[[64, 49, 31]]
ratings_by_label
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
64 | Reading | 148 | 4.159392 | 4.188919 | 4.177500 | 3.882635 |
49 | Speech | 351 | 4.133191 | 4.101197 | 4.119345 | 3.674701 |
31 | Theater | 1078 | 3.876633 | 3.821503 | 3.851837 | 3.584508 |
If we send .loc
two arguments, separated by commas, then the first argument
refers to the rows, and the second to the columns. The column labels are the
column names. Here we ask for the rows labeled 64, 49, 31, and the column labeled “Discipline”:
ratings_by_row_col_label = top_by_easy.loc[[64, 49, 31], 'Discipline']
ratings_by_row_col_label
64 Reading
49 Speech
31 Theater
Name: Discipline, dtype: object
If we want multiple columns we can pass a list of column names:
ratings_by_row_col_label = top_by_easy.loc[[64, 49, 31], ['Discipline', 'Clarity']]
ratings_by_row_col_label
Discipline | Clarity | |
---|---|---|
64 | Reading | 4.159392 |
49 | Speech | 4.133191 |
31 | Theater | 3.876633 |
This is a good way of selecting a subset of the columns from the data frame,
using :
to select all the rows:
some_columns = top_by_easy.loc[:, ['Discipline', 'Easiness']]
some_columns
Discipline | Easiness | |
---|---|---|
64 | Reading | 3.882635 |
36 | Physical Education | 3.832250 |
49 | Speech | 3.674701 |
61 | Child Development | 3.606082 |
31 | Theater | 3.584508 |
19 | Music | 3.542273 |
You can use Boolean sequences to select rows with .loc
.
clear_clarity = top_by_easy.loc[is_clear, 'Clarity']
clear_clarity
64 4.159392
49 4.133191
Name: Clarity, dtype: float64
You can also use some Boolean sequences for .iloc
, but it’s a bit more
complicated. See Booleans and labels for more detail.