Handling Pandas safely¶
A lot of Pandas’ design is for speed and efficiency.
Unfortunately, this sometimes means that is it easy to use Pandas incorrectly, and so get results that you do not expect.
This page has some rules we suggest you follow to stay out of trouble when using Pandas.
As your understanding increases, you may find that you can relax some of these rules, but the problems in this page can trip up experts, so please, be very careful, and only relax these rules when you are very confident you understand the underlying problems. See Gory Pandas for a short walk through some of the complexities.
Copies and views¶
Consider this data frame, which should be familiar. It is a table where the rows are course subjects and the columns include average ratings for all University professors / lecturers teaching that subject. See the dataset page for more detail.
import pandas as pd
ratings = pd.read_csv('rate_my_course.csv')
ratings.head()
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
0 | English | 23343 | 3.756147 | 3.821866 | 3.791364 | 3.162754 |
1 | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
2 | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
4 | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
Now imagine that we have discovered that the rating for ‘Clarity’ in the first row is incorrect; it should be 4.0.
We get ready to make a new, fixed copy of the data frame, to store the modified values. We put the ‘Disciplines’ column into the data frame to start with.
fixed_ratings = pd.DataFrame()
fixed_ratings['Discipline'] = ratings['Discipline']
Our next obvious step is to get the ‘Clarity’ column as a Pandas Series, for us to work on.
clarity = ratings['Clarity']
clarity.head()
0 3.756147
1 3.487379
2 3.608331
3 3.909520
4 3.788818
Name: Clarity, dtype: float64
We set the corrected first value:
clarity.iloc[0] = 4
clarity.head()
/Users/mb312/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py:1637: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)
0 4.000000
1 3.487379
2 3.608331
3 3.909520
4 3.788818
Name: Clarity, dtype: float64
Notice the warning. We will come back to that soon.
Notice too that we have changed the value in the clarity
Series.
Consider — what happens to the matching value in the original data frame?
To answer that question, we need to know what kind of thing our clarity
Series was.
clarity
could be a view onto the ‘Clarity’ column in the original data
frame ratings
. A view is something that points to the same memory. When
we have a view, the view is another way of looking at the same data. If we
modify the data in the view, that means we also modify the original data frame,
because the data is the same.
clarity
could also be copy of the ‘Clarity’ column. A copy duplicates the
values from the original data. Therefore a copy has its own values, and its
own memory. Changing the data in the copy will have no effect on the original
data frame, because the data is different.
ratings.head()
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
0 | English | 23343 | 4.000000 | 3.821866 | 3.791364 | 3.162754 |
1 | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
2 | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
4 | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
We have found that the clarity
Series was a view, because the change we made to clarity
also changed the value in the original data frame.
This may not be what you expected, and it is probably not what you meant to do.
This leads to the first rule for safe handling of Pandas.
Rule 1: copy right.¶
We strongly suggest that when you get stuff out of a Pandas data frame or Series by indexing, you always make what you have into a copy.
We call this rule copy right.
As a reminder indexing is where we fetch data from something using square
brackets. Indexing can be: direct, with the square brackets directly
following the data frame or Series; or indirect, where the square brackets
follow the .loc
or .iloc
attributes of the data frame or Series.
For example, we have just used direct indexing (square brackets) to fetch the
‘Clarity’ data out of the ratings
data frame.
# Indexing to fetch a Series from a data frame.
clarity = ratings['Clarity']
We found that clarity
is a view onto the ‘Clarity’ data in ratings
. This is rarely what we want.
Here we apply the copy right rule:
# Applying the "copy right" rule.
clearer_clarity = ratings['Clarity'].copy()
Notice we apply the .copy()
method to the ‘Clarity’ Series, so forcing Pandas to make us a copy of the data.
Now we have done that, we can modify the result without affecting the original data frame.
# Modify the copy with some crazy value.
clearer_clarity.iloc[0] = 99
clearer_clarity.head()
0 99.000000
1 3.487379
2 3.608331
3 3.909520
4 3.788818
Name: Clarity, dtype: float64
This does not affect the original data frame:
ratings.head()
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
0 | English | 23343 | 4.000000 | 3.821866 | 3.791364 | 3.162754 |
1 | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
2 | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
4 | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
Copies, views, confusing, warnings¶
It can be very difficult to predict when Pandas indexing will give a copy or a view.
For example, here we use indirect indexing (square brackets following .loc
)
to select the row of ratings
with index label 0. Remember .loc
indexing
uses the index labels, not positions, although in this case the index has
label 0 for position 0.
row_0 = ratings.loc[0]
row_0
Discipline English
Number of Professors 23343
Clarity 4.0
Helpfulness 3.821866
Overall Quality 3.791364
Easiness 3.162754
Name: 0, dtype: object
We saw earlier that direct indexing to select a column ‘Clarity’ gave us
a view, meaning that we could change the values in the data frame by changing
the Series clarity
we got from indexing. In fact this is also true if we use indirect indexing with .loc
or .iloc
. Check this by trying clarity = ratings.loc[:, 'Clarity']
in the code above.
We have just fetched the row labeled 0 using .loc
. Given what we know about
fetching a column, it would be reasonable to predict this would give us a view.
Does it give a view? Or a copy?
# Changing the 'Clarity' value of the first row.
row_0.loc['Clarity'] = 5
row_0
/Users/mb312/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py:1637: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)
/Users/mb312/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py:692: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
iloc._setitem_with_indexer(indexer, value, self.name)
Discipline English
Number of Professors 23343
Clarity 5
Helpfulness 3.821866
Overall Quality 3.791364
Easiness 3.162754
Name: 0, dtype: object
Notice the warning, again.
But - this time - did we change the original data frame?
ratings.head()
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
0 | English | 23343 | 4.000000 | 3.821866 | 3.791364 | 3.162754 |
1 | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
2 | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
4 | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
No, we didn’t change the original data frame — and we conclude that
row_0
is a copy.
Our first, correct, response is to follow the copy right rule, and make this copy explicit, so we know exactly what we have:
# The "copy right" rule again.
copied_row_0 = ratings.iloc[0].copy()
We no longer have a nasty warning when we modify copied_row_0
, because
Pandas knows we made a copy, so it does not need to warn us that we may be
making a mistake:
# We don't get a warning when we change the copied result.
copied_row_0.loc['Clarity'] = 5
copied_row_0
Discipline English
Number of Professors 23343
Clarity 5
Helpfulness 3.821866
Overall Quality 3.791364
Easiness 3.162754
Name: 0, dtype: object
Please do worry about these warnings. In fact, in the interests of safety, we come to rule 2.
Rule 2: make errors for copy/view warnings¶
Pandas has a setting that allows you to change the nasty warning about setting with copies into an error.
We strongly suggest that you do that, for all your notebooks, like this:
pd.set_option('mode.chained_assignment', 'raise')
After you have done this, Pandas will stop if you try to do something like this:
row_0 = ratings.loc[0] # Copy? Or view? Difficult to guess.
# Now this generates an error.
row_0.loc['Clarity'] = 99
---------------------------------------------------------------------------
SettingWithCopyError Traceback (most recent call last)
<ipython-input-17-cc9b666af2bd> in <module>
1 row_0 = ratings.loc[0] # Copy? Or view? Difficult to guess.
2 # Now this generates an error.
----> 3 row_0.loc['Clarity'] = 99
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
690
691 iloc = self if self.name == "iloc" else self.obj.iloc
--> 692 iloc._setitem_with_indexer(indexer, value, self.name)
693
694 def _validate_key(self, key, axis: int):
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value, name)
1635 self._setitem_with_indexer_split_path(indexer, value, name)
1636 else:
-> 1637 self._setitem_single_block(indexer, value, name)
1638
1639 def _setitem_with_indexer_split_path(self, indexer, value, name: str):
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _setitem_single_block(self, indexer, value, name)
1855
1856 # check for chained assignment
-> 1857 self.obj._check_is_chained_assignment_possible()
1858
1859 # actually do the set
~/Library/Python/3.8/lib/python/site-packages/pandas/core/generic.py in _check_is_chained_assignment_possible(self)
3858 return True
3859 elif self._is_copy:
-> 3860 self._check_setitem_copy(stacklevel=4, t="referent")
3861 return False
3862
~/Library/Python/3.8/lib/python/site-packages/pandas/core/generic.py in _check_setitem_copy(self, stacklevel, t, force)
3934
3935 if value == "raise":
-> 3936 raise com.SettingWithCopyError(t)
3937 elif value == "warn":
3938 warnings.warn(t, com.SettingWithCopyWarning, stacklevel=stacklevel)
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
At first you will find this advice annoying. Your code will generate confusing errors, and you will be tempted to remove this error option to make the errors go away. Please be patient. You will find that, if you follow the copy right rule carefully, most of these errors go away.
Copy, views, on the left¶
Now consider this code:
ratings.loc[0].loc['Clarity'] = 99
---------------------------------------------------------------------------
SettingWithCopyError Traceback (most recent call last)
<ipython-input-18-ca8596f27095> in <module>
----> 1 ratings.loc[0].loc['Clarity'] = 99
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
690
691 iloc = self if self.name == "iloc" else self.obj.iloc
--> 692 iloc._setitem_with_indexer(indexer, value, self.name)
693
694 def _validate_key(self, key, axis: int):
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value, name)
1635 self._setitem_with_indexer_split_path(indexer, value, name)
1636 else:
-> 1637 self._setitem_single_block(indexer, value, name)
1638
1639 def _setitem_with_indexer_split_path(self, indexer, value, name: str):
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _setitem_single_block(self, indexer, value, name)
1855
1856 # check for chained assignment
-> 1857 self.obj._check_is_chained_assignment_possible()
1858
1859 # actually do the set
~/Library/Python/3.8/lib/python/site-packages/pandas/core/generic.py in _check_is_chained_assignment_possible(self)
3858 return True
3859 elif self._is_copy:
-> 3860 self._check_setitem_copy(stacklevel=4, t="referent")
3861 return False
3862
~/Library/Python/3.8/lib/python/site-packages/pandas/core/generic.py in _check_setitem_copy(self, stacklevel, t, force)
3934
3935 if value == "raise":
-> 3936 raise com.SettingWithCopyError(t)
3937 elif value == "warn":
3938 warnings.warn(t, com.SettingWithCopyWarning, stacklevel=stacklevel)
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Because we have set the mode.chained_assignment
option to error
above, this generates an error — but why?
The reason is the same as the reason for the previous error. The code in the cell directly above is just a short-cut for this exact equivalent.
tmp = ratings.loc[0]
tmp.loc['Clarity'] = 99
---------------------------------------------------------------------------
SettingWithCopyError Traceback (most recent call last)
<ipython-input-19-48a6437f4642> in <module>
1 tmp = ratings.loc[0]
----> 2 tmp.loc['Clarity'] = 99
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
690
691 iloc = self if self.name == "iloc" else self.obj.iloc
--> 692 iloc._setitem_with_indexer(indexer, value, self.name)
693
694 def _validate_key(self, key, axis: int):
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value, name)
1635 self._setitem_with_indexer_split_path(indexer, value, name)
1636 else:
-> 1637 self._setitem_single_block(indexer, value, name)
1638
1639 def _setitem_with_indexer_split_path(self, indexer, value, name: str):
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _setitem_single_block(self, indexer, value, name)
1855
1856 # check for chained assignment
-> 1857 self.obj._check_is_chained_assignment_possible()
1858
1859 # actually do the set
~/Library/Python/3.8/lib/python/site-packages/pandas/core/generic.py in _check_is_chained_assignment_possible(self)
3858 return True
3859 elif self._is_copy:
-> 3860 self._check_setitem_copy(stacklevel=4, t="referent")
3861 return False
3862
~/Library/Python/3.8/lib/python/site-packages/pandas/core/generic.py in _check_setitem_copy(self, stacklevel, t, force)
3934
3935 if value == "raise":
-> 3936 raise com.SettingWithCopyError(t)
3937 elif value == "warn":
3938 warnings.warn(t, com.SettingWithCopyWarning, stacklevel=stacklevel)
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Specifically, when Python sees ratings.loc[0].loc['Clarity'] = 99
, it first
evaluates ratings.loc[0]
to generate a temporary copy. In the code above, we
called this temporary copy tmp
. It then tries to set the value into the copy
with tmp.loc['Clarity'] = 99
. This generates the same error as you saw
before.
As you have probably guessed from the option name above, Pandas calls this
chained assignment, because you are: first, fetching the stuff you want do
the assignment on (ratings.loc[0]
) and then doing the assignment
.loc['Clarity'] = 99
. There are two steps on the left hand side, in a chain,
first fetching the data, then assigning.
The problem that Pandas has is that it cannot tell that this chained assignment
has happened, so it can’t tell what you mean. Python will ask Pandas to
generate ratings.loc[0]
first, which it does, to generate the temporary copy
that we can call tmp
. Python then ask Pandas to set the value with
tmp.loc['Clarity'] = 99
. When Pandas gets this second instruction, it has no
way of knowing that tmp
came from the combined instruction
ratings.loc[0].loc['Clarity'] = 99
, and so all it can do is set the value
into the copy, as instructed.
This leads us to the last rule.
Rule 3: loc left¶
When you do want to use indexing on the left hand side, to set some values into
a data frame or Series, try do to this all in one shot, using indirect indexing
with .loc
or iloc
.
For example, you have just see that this generates an error, and why:
ratings.loc[0].loc['Clarity'] = 99
---------------------------------------------------------------------------
SettingWithCopyError Traceback (most recent call last)
<ipython-input-20-ca8596f27095> in <module>
----> 1 ratings.loc[0].loc['Clarity'] = 99
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
690
691 iloc = self if self.name == "iloc" else self.obj.iloc
--> 692 iloc._setitem_with_indexer(indexer, value, self.name)
693
694 def _validate_key(self, key, axis: int):
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value, name)
1635 self._setitem_with_indexer_split_path(indexer, value, name)
1636 else:
-> 1637 self._setitem_single_block(indexer, value, name)
1638
1639 def _setitem_with_indexer_split_path(self, indexer, value, name: str):
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexing.py in _setitem_single_block(self, indexer, value, name)
1855
1856 # check for chained assignment
-> 1857 self.obj._check_is_chained_assignment_possible()
1858
1859 # actually do the set
~/Library/Python/3.8/lib/python/site-packages/pandas/core/generic.py in _check_is_chained_assignment_possible(self)
3858 return True
3859 elif self._is_copy:
-> 3860 self._check_setitem_copy(stacklevel=4, t="referent")
3861 return False
3862
~/Library/Python/3.8/lib/python/site-packages/pandas/core/generic.py in _check_setitem_copy(self, stacklevel, t, force)
3934
3935 if value == "raise":
-> 3936 raise com.SettingWithCopyError(t)
3937 elif value == "warn":
3938 warnings.warn(t, com.SettingWithCopyWarning, stacklevel=stacklevel)
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
You can avoid that error by doing all your left-hand-side indexing in one shot, like this:
ratings.loc[0, 'Clarity'] = 99
Notice there is no error. This is because, in this second case, Pandas gets
all the instructions in one go. It can see from this combined instruction that
we meant to set the ‘Clarity’ value for the row labeled 0 in the ratings
data frame, and does just this.
Keep calm, follow the three rules¶
Do not worry if some of this is not immediately clear; it is not easy.
The trick is to remember the three rules:
Copy right.
Make copy warnings into errors.
Use
.loc
and.iloc
for your left-hand-side indexing.