Gory Pandas
Gory Pandas
This notebook is about the difficult, painful, maybe even bloody subject of views and copies when using Pandas.
You don’t need to fully understand the results on this page to use Pandas effectively. This page is only to point out that you have to use the results of Pandas indexing with care. In particular, we suggest you follow the Pandas safe handling guide.
The official discussion for these topics are in the Pandas indexing documentation. If you really want to go deep into the Pandas bowels, you could even try this StackOverflow answer, but be warned, it’s dark down there.
import numpy as np
import pandas as pd
This is the course ratings dataset, where the rows are course subjects and the columns include average ratings for all University professors / lecturers teaching that subject. See the dataset page for more detail.
ratings = pd.read_csv('rate_my_course.csv')
ratings.head()
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
0 | English | 23343 | 3.756147 | 3.821866 | 3.791364 | 3.162754 |
1 | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
2 | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
4 | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
We make a smaller data frame to play with, using .iloc
.
# First three rows, first three columns
first_33 = ratings.iloc[:3, :3]
first_33
Discipline | Number of Professors | Clarity | |
---|---|---|---|
0 | English | 23343 | 3.756147 |
1 | Mathematics | 22394 | 3.487379 |
2 | Biology | 11774 | 3.608331 |
First consider the following. discipline
is a Series that is a view of the
values in first_33
.
discipline = first_33['Discipline']
discipline
0 English
1 Mathematics
2 Biology
Name: Discipline, dtype: object
We can’t tell it’s a view yet, but this becomes clear when we change the values
in discipline
. First we change the first value in the Series, and we get
a big warning. See the Pandas safe handling guide for more on this warning.
discipline.iloc[0] = 'Spanglish'
As expected, the value in discipline
has changed:
discipline
0 Spanglish
1 Mathematics
2 Biology
Name: Discipline, dtype: object
Our question now is — what happened to the values in first_33
— the data
frame from which we fetched discipline
. It turns out that discipline
was
a view. That means that the discipline
values are the same memory as the
first_33
values, and so we see the changes in first_33
as well:
first_33
Discipline | Number of Professors | Clarity | |
---|---|---|---|
0 | Spanglish | 23343 | 3.756147 |
1 | Mathematics | 22394 | 3.487379 |
2 | Biology | 11774 | 3.608331 |
Unfortunately, it can be very difficult to work out whether you have a view or a copy. A copy has duplicates of the values, that are in different memory from the original data frame. In that case, modifying the copy modifies the duplicates, but not the original data frame.
For example, consider this. Is this a view or a copy?
first_row = first_33.iloc[0]
first_row
Discipline Spanglish
Number of Professors 23343
Clarity 3.75615
Name: 0, dtype: object
To test whether it is a view or a copy, we set the first value:
first_row.iloc[0] = 'Franglais'
first_row
Discipline Franglais
Number of Professors 23343
Clarity 3.75615
Name: 0, dtype: object
Now we look at the data frame from which first_row
came. If it was a view, then the original data frame will have changed in the same way as first_row
. If it was a copy, the original data frame will not change.
first_33
Discipline | Number of Professors | Clarity | |
---|---|---|---|
0 | Spanglish | 23343 | 3.756147 |
1 | Mathematics | 22394 | 3.487379 |
2 | Biology | 11774 | 3.608331 |
first_row
was a copy — because changing first_row
did not change the
original data frame.
Maybe you are thinking that you are getting the hang of this, but tarry awhile – there are many ways in which this can be confusing.
Look at this bit of code. Do you think that first_33
will change?
# Will first_33 change?
first_33.iloc[0].iloc[0] = 'Franglais'
First guess whether first_33
will change. Now have a look whether the top
left value has changed to ‘Franglais’.
first_33
Discipline | Number of Professors | Clarity | |
---|---|---|---|
0 | Spanglish | 23343 | 3.756147 |
1 | Mathematics | 22394 | 3.487379 |
2 | Biology | 11774 | 3.608331 |
In fact the code above: first_33.iloc[0].iloc[0] = 'Franglais'
is exactly
equivalent to the code we have already seen above:
first_row = first_33.iloc[0]
first_row.iloc[0] = 'Franglais'
Because it is exactly equivalent - it has the same result - it does not change the underlying data frame, even though it looks as if it should.
first_33
Discipline | Number of Professors | Clarity | |
---|---|---|---|
0 | Spanglish | 23343 | 3.756147 |
1 | Mathematics | 22394 | 3.487379 |
2 | Biology | 11774 | 3.608331 |
The two versions are equivalent, because, when we run:
first_33.iloc[0].iloc[0] = 'Franglais'
— this first causes first_33.iloc[0]
to make a copy, and after that, the
.iloc[0] = 'Franglais'
works on that copy, but we don’t see the result,
because we aren’t saving the copy anywhere, and it disappears into obscurity
when we have run the code.
The Pandas indexing
documentation
refers to this as chained assignment, in the sense that we first do
first_33.iloc[0]
and then, in a subsequent (chained) assignment, we do
.iloc[0] = 'Franglais'
on the result.
The chaining is fairly obvious in the both-at-the-same-time version above, but it can be a harder to spot when the assignments are separated, even by a line, as in:
first_row = first_33.iloc[0]
first_row.iloc[0] = 'Franglais'
This version over two lines is exactly equivalent, so is also chained assignment. It can be even more difficult to spot when the lines are a bit separated:
first_row = first_33.iloc[0]
# Do something
# Do something else
# And something else again.
# And then - the chained assignment!
first_row.iloc[0] = 'Franglais'
If you are already feeling confused, the confusion can get worse. Consider
this slight variation on our original first_33
data frame:
# First 3 rows, last three columns.
first_3_end_3 = ratings.iloc[:3, 3:]
first_3_end_3
Helpfulness | Overall Quality | Easiness | |
---|---|---|---|
0 | 3.821866 | 3.791364 | 3.162754 |
1 | 3.641526 | 3.566867 | 3.063322 |
2 | 3.701530 | 3.657641 | 2.710459 |
Knowing what you know now - does the following give a view on the row, or a copy of the row?
first_row_f3e3 = first_3_end_3.iloc[0]
first_row_f3e3
Helpfulness 3.821866
Overall Quality 3.791364
Easiness 3.162754
Name: 0, dtype: float64
We can check by setting a value on the row.
first_row_f3e3.iloc[0] = 99
first_row_f3e3
Helpfulness 99.000000
Overall Quality 3.791364
Easiness 3.162754
Name: 0, dtype: float64
Now you have guessed, have a look at the output of the cell below. If
first_row_f3e3
was a view, then the first value in the first column of the
underlying data frame — first_3_end_3
— will have changed to 99. If it
was a copy, it will have the value it had before - 3.821866.
first_3_end_3
Helpfulness | Overall Quality | Easiness | |
---|---|---|---|
0 | 99.000000 | 3.791364 | 3.162754 |
1 | 3.641526 | 3.566867 | 3.063322 |
2 | 3.701530 | 3.657641 | 2.710459 |
Was first_row_f3e3
a view or a copy? Did you guess right?
The point of all this is to say that - when you take stuff out of a Pandas data frame with indexing, it can be very difficult to predict whether you have a view or a copy, and it can depend what data you have in your date frame.
For example, above, we found that if all the data in the data frame are floats,
then I get a view, but in our previous data frame – first_33
— that has
a mixture of column types, including strings and numbers, I got a copy.
The way out of this steaming set of tubes into hell, is to use safe handling of Pandas.