6.8 Safe Pandas

Download notebook Interact

A lot of Pandas’ design is for speed and efficiency.

Unfortunately, this sometimes means that is it easy to use Pandas incorrectly, and so get results that you do not expect.

This page has some rules we suggest you follow to stay out of trouble when using Pandas.

As your understanding increases, you may find that you can relax some of these rules, but the problems in this page can trip up experts, so please, be very careful, and only relax these rules when you are very confident you understand the underlying problems. See Gory Pandas for a short walk through some of the complexities.

Copies and views

Consider this data frame, which should be familiar. It is a table where the rows are course subjects and the columns include average ratings for all University professors / lecturers teaching that subject. See the dataset page for more detail.

import pandas as pd
ratings = pd.read_csv('rate_my_course.csv')
ratings.head()
Discipline Number of Professors Clarity Helpfulness Overall Quality Easiness
0 English 23343 3.756147 3.821866 3.791364 3.162754
1 Mathematics 22394 3.487379 3.641526 3.566867 3.063322
2 Biology 11774 3.608331 3.701530 3.657641 2.710459
3 Psychology 11179 3.909520 3.887536 3.900949 3.316210
4 History 11145 3.788818 3.753642 3.773746 3.053803

Now imagine that we have discovered that the rating for ‘Clarity’ in the first row is incorrect; it should be 4.0.

We get ready to make a new, fixed copy of the data frame, to store the modified values. We put the ‘Disciplines’ column into the data frame to start with.

fixed_ratings = pd.DataFrame()
fixed_ratings['Discipline'] = ratings['Discipline']

Our next obvious step is to get the ‘Clarity’ column as a Pandas Series, for us to work on.

clarity = ratings['Clarity']
clarity.head()
0    3.756147
1    3.487379
2    3.608331
3    3.909520
4    3.788818
Name: Clarity, dtype: float64

We set the corrected first value:

clarity.iloc[0] = 4
clarity.head()
/Users/mb312/Library/Python/3.7/lib/python/site-packages/pandas/core/indexing.py:189: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)

0    4.000000
1    3.487379
2    3.608331
3    3.909520
4    3.788818
Name: Clarity, dtype: float64

Notice the warning. We will come back to that soon.

Notice too that we have changed the value in the clarity Series.

Consider — what happens to the matching value in the original data frame?

To answer that question, we need to know what kind of thing our clarity Series was.

clarity could be a view onto the ‘Clarity’ column in the original data frame ratings. A view is something that points to the same memory. When we have a view, the view is another way of looking at the same data. If we modify the data in the view, that means we also modify the original data frame, because the data is the same.

clarity could also be copy of the ‘Clarity’ column. A copy duplicates the values from the original data. Therefore a copy has its own values, and its own memory. Changing the data in the copy will have no effect on the original data frame, because the data is different.

ratings.head()
Discipline Number of Professors Clarity Helpfulness Overall Quality Easiness
0 English 23343 4.000000 3.821866 3.791364 3.162754
1 Mathematics 22394 3.487379 3.641526 3.566867 3.063322
2 Biology 11774 3.608331 3.701530 3.657641 2.710459
3 Psychology 11179 3.909520 3.887536 3.900949 3.316210
4 History 11145 3.788818 3.753642 3.773746 3.053803

We have found that the clarity Series was a view, because the change we made to clarity also changed the value in the original data frame.

This may not be what you expected, and it is probably not what you meant to do.

This leads to the first rule for safe handling of Pandas.

Rule 1: copy right.

We strongly suggest that when you get stuff out of a Pandas data frame or Series by indexing, you always make what you have into a copy.

We call this rule copy right.

As a reminder indexing is where we fetch data from something using square brackets. Indexing can be: direct, with the square brackets directly following the data frame or Series; or indirect, where the square brackets follow the .loc or .iloc attributes of the data frame or Series.

For example, we have just used direct indexing (square brackets) to fetch the ‘Clarity’ data out of the ratings data frame.

# Indexing to fetch a Series from a data frame.
clarity = ratings['Clarity']

We found that clarity is a view onto the ‘Clarity’ data in ratings. This is rarely what we want.

Here we apply the copy right rule:

# Applying the "copy right" rule.
clearer_clarity = ratings['Clarity'].copy()

Notice we apply the .copy() method to the ‘Clarity’ Series, so forcing Pandas to make us a copy of the data.

Now we have done that, we can modify the result without affecting the original data frame.

# Modify the copy with some crazy value.
clearer_clarity.iloc[0] = 99
clearer_clarity.head()
0    99.000000
1     3.487379
2     3.608331
3     3.909520
4     3.788818
Name: Clarity, dtype: float64

This does not affect the original data frame:

ratings.head()
Discipline Number of Professors Clarity Helpfulness Overall Quality Easiness
0 English 23343 4.000000 3.821866 3.791364 3.162754
1 Mathematics 22394 3.487379 3.641526 3.566867 3.063322
2 Biology 11774 3.608331 3.701530 3.657641 2.710459
3 Psychology 11179 3.909520 3.887536 3.900949 3.316210
4 History 11145 3.788818 3.753642 3.773746 3.053803

Copies, views, confusing, warnings

It can be very difficult to predict when Pandas indexing will give a copy or a view.

For example, here we use indirect indexing (square brackets following .iloc) to select the row of ratings with index label 0. Remember .loc indexing uses the index labels, not positions, although in this case the index has label 0 for position 0.

row_0 = ratings.loc[0]
row_0
Discipline              English
Number of Professors      23343
Clarity                       4
Helpfulness             3.82187
Overall Quality         3.79136
Easiness                3.16275
Name: 0, dtype: object

We saw earlier that direct indexing to select a column ‘Clarity’ gave us a view, meaning that we could change the values in the data frame by changing the Series clarity we got from indexing. In fact this is also true if we use indirect indexing with .loc or .iloc. Check this by trying clarity = ratings.loc[:, 'Clarity'] in the code above.

We have just fetched the row labeled 0 using .loc. Given what we know about fetching a column, it would be reasonable to predict this would give us a view.

Does it give a view? Or a copy?

# Changing the 'Clarity' value of the first row.
row_0.loc['Clarity'] = 5
row_0
/Users/mb312/Library/Python/3.7/lib/python/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  

Discipline              English
Number of Professors      23343
Clarity                       5
Helpfulness             3.82187
Overall Quality         3.79136
Easiness                3.16275
Name: 0, dtype: object

We will get to the warning soon.

Did we change the original data frame?

ratings.head()
Discipline Number of Professors Clarity Helpfulness Overall Quality Easiness
0 English 23343 4.000000 3.821866 3.791364 3.162754
1 Mathematics 22394 3.487379 3.641526 3.566867 3.063322
2 Biology 11774 3.608331 3.701530 3.657641 2.710459
3 Psychology 11179 3.909520 3.887536 3.900949 3.316210
4 History 11145 3.788818 3.753642 3.773746 3.053803

No, we didn’t change the original data frame — and we conclude that row_0 is a copy.

Our first, correct, response is to follow the copy right rule, and make this copy explicit, so we know exactly what we have:

# The "copy right" rule again.
copied_row_0 = ratings.iloc[0].copy()

We no longer have a nasty warning when we modify copied_row_0, because Pandas knows we made a copy, so it does not need to warn us that we may be making a mistake:

# We don't get a warning when we change the copied result.
copied_row_0.loc['Clarity'] = 5
copied_row_0
Discipline              English
Number of Professors      23343
Clarity                       5
Helpfulness             3.82187
Overall Quality         3.79136
Easiness                3.16275
Name: 0, dtype: object

Please worry about these warnings. In fact, in the interests of safety, we come to rule 2.

Rule 2: make errors for copy/view warnings

Pandas has a setting that allows you to change the nasty warning about setting with copies into an error.

We strongly suggest that you do that, for all your notebooks, like this:

pd.set_option('mode.chained_assignment','raise')

After you have done this, Pandas will stop if you try to do something like this:

row_0 = ratings.loc[0]   # Copy?  Or wiew?  Difficult to guess.
# Now this generates an error.
row_0.loc['Clarity'] = 99
SettingWithCopyError                      Traceback (most recent call last)
   ...
SettingWithCopyError: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

At first you will find this advice annoying. Your code will generate confusing errors, and you will be tempted to remove this error option to make the errors go away. Please be patient. You will find that, if you follow the copy right rule carefully, most of these errors go away.

Copy, views, on the left

Now consider this code:

ratings.loc[0].loc['Clarity'] = 99
SettingWithCopyError                      Traceback (most recent call last)
   ...
SettingWithCopyError: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Because we have set the mode.chained_assignment option to error above, this generates an error — but why?

The reason is the same as the reason for the previous error. The code in the cell directly above is just a short-cut for this exact equivalent.

row_0 = ratings.loc[0]
row_0.loc['Clarity'] = 99
SettingWithCopyError                      Traceback (most recent call last)
   ...
SettingWithCopyError: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Specifically, when Python sees ratings.loc[0].loc['Clarity'] = 99, it first evaluates ratings.loc[0] to generate a temporary copy. Let’s call this temporary copy tmp. It then tries to set the value into the copy with tmp.loc['Clarity'] = 99. This generates the same error as you saw before.

As you have probably guessed from the option name above, Pandas calls this chained assignment, because you are: first, fetching the stuff you want do the assignment on (ratings.loc[0]) and then doing the assignment .loc['Clarity'] = 99. There are two steps on the left hand side, in a chain, first fetching the data, then assigning.

The problem that Pandas has is that it cannot tell that this has happened, so it can’t tell what you mean. Python will ask Pandas to generate ratings.loc[0] first, which it does, to generate the temporary copy that we can call tmp. Python then ask Pandas to set the value with tmp.loc['Clarity'] = 99. When Pandas gets this second instruction, it has no way of knowing that tmp came from the combined instruction ratings.loc[0].loc['Clarity'] = 99, and so all it can do is set the value into the copy, as instructed.

This leads us to the last rule.

Rule 3: loc left

When you do want to use indexing on the left hand side, to set some values into a data frame or Series, try do to this all in one shot, using indirect indexing with .loc or iloc.

For example, you have just see that this generates an error, and why:

ratings.loc[0].loc['Clarity'] = 99
SettingWithCopyError                      Traceback (most recent call last)
   ...
SettingWithCopyError: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

You can avoid that error by doing all your left-hand-side indexing in one shot, like this:

ratings.loc[0, 'Clarity'] = 99

Notice there is no error. This is because, in this second case, Pandas gets all the instructions in one go. It can see from this combined instruction that we meant to set the ‘Clarity’ value for the row labeled 0 in the ratings data frame, and does just this.

Keep calm, follow the three rules

Do not worry if some of this is not immediately clear; it is not easy.

The trick is to remember the three rules:

  • Copy right.
  • Make copy warnings into errors.
  • Use .loc and .iloc for your left-hand-side indexing.