The power of groupby

import numpy as np
import pandas as pd
# Safe settings for Pandas.
pd.set_option('mode.chained_assignment', 'raise')

This page follows from the Crosstab page.

As for the Crosstab page, we will use a dataset with data about passengers on the RMS Titanic.

See the Titanic dataset page for some more information about this dataset, and a link to download the data.

# Load the dataset
df = pd.read_csv('titanic_clean.csv')
df.head()
name gender age class embarked country fare survived
0 Abbing, Mr. Anthony male 42.0 3rd Southampton United States 7.11 no
1 Abbott, Mr. Eugene Joseph male 13.0 3rd Southampton United States 20.05 no
2 Abbott, Mr. Rossmore Edward male 16.0 3rd Southampton United States 20.05 no
3 Abbott, Mrs. Rhoda Mary 'Rosa' female 39.0 3rd Southampton England 20.05 yes
4 Abelseth, Miss. Karen Marie female 16.0 3rd Southampton Norway 7.13 yes

In the Crosstab page, we calculated counts and proportions of rows falling into categories given by Pandas Series. These Series are often columns from a Pandas data frame.

Pandas also has a very powerful mechanism for grouping observations by category, available by the Pandas data frame groupby method.

The transformations available with groupby are far more general than those in pd.crosstab. Among many other things, groupby allows us to calculate many different summaries of rows in different categories.

For much more detail on groupby, see the Python Data Science Handbook chapter, by Jake Vanderplas.

Just for example, imagine we were interested in the average age of the passengers in each class.

Groups and GroupBy

We could start by making a Pandas groupby object, like this:

by_class = df.groupby('class')
by_class
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10d70f730>

The GroupBy object has categorized the rows according to the values in the class column. Therefore, the GroupBy object has three groups, one for 1st, one of 2nd and one for 3rd.

We can see how many groups we have with Python’s len function:

# The number of groups.
len(by_class)
3

If the fancy takes us, we can fetch the rows corresponding to the groups with the get_group method.

first_class_passengers = by_class.get_group('1st')
first_class_passengers.head()
name gender age class embarked country fare survived
20 Allen, Miss. Elisabeth Walton female 29.000000 1st Southampton United States 211.6009 yes
22 Allison, Master. Hudson Trevor male 0.916667 1st Southampton Canada 151.1600 yes
23 Allison, Miss. Helen Loraine female 2.000000 1st Southampton Canada 151.1600 no
24 Allison, Mr. Hudson Joshua Creighton male 30.000000 1st Southampton Canada 151.1600 no
25 Allison, Mrs. Bessie Waldo female 25.000000 1st Southampton United States 151.1600 no

This is exactly the same result we would get by selecting the first class passenger rows directly:

first_again = df[df['class'] == '1st']
first_again.head()
name gender age class embarked country fare survived
20 Allen, Miss. Elisabeth Walton female 29.000000 1st Southampton United States 211.6009 yes
22 Allison, Master. Hudson Trevor male 0.916667 1st Southampton Canada 151.1600 yes
23 Allison, Miss. Helen Loraine female 2.000000 1st Southampton Canada 151.1600 no
24 Allison, Mr. Hudson Joshua Creighton male 30.000000 1st Southampton Canada 151.1600 no
25 Allison, Mrs. Bessie Waldo female 25.000000 1st Southampton United States 151.1600 no

Aggregating

The true power of GroupBy is in doing efficient operations on the groups, and returning the results.

Just for example, imagine that we wanted to get the mean age and fare for each class. We could do this the long way round, by identifying each set of rows, and taking their mean:

print('1st mean', df[df['class'] == '1st'].mean())
print('2nd mean', df[df['class'] == '2nd'].mean())
print('3rd mean', df[df['class'] == '3rd'].mean())
1st mean age     39.441584
fare    84.256920
dtype: float64
2nd mean age     30.193916
fare    21.543087
dtype: float64
3rd mean age     24.978557
fare    12.649753
dtype: float64

Notice that the mean function only returned results for the numerical columns. It did not, for example, try to take a mean of the passenger name.

The code above is a repetitive and ugly - but it could have been much worse. Imagine, for example, that we had been interested in the mean age for passengers for each country. There are lots of countries; it would not be practical to type these all out by hand as we have above for the classes.

df['country'].value_counts()
England                  323
United States            257
Ireland                  119
Sweden                   104
Lebanon                   70
Finland                   54
Canada                    33
Norway                    26
France                    23
Belgium                   22
Bulgaria                  17
Switzerland               17
Scotland                  16
Croatia (Modern)          12
Croatia                   11
Channel Islands           10
Spain                      8
Wales                      8
Argentina                  7
India                      7
Hungary                    7
Denmark                    7
South Africa               6
Turkey                     6
Germany                    5
Bosnia                     4
Slovenia                   4
Italy                      4
Poland                     3
Uruguay                    2
Peru                       2
Austria                    2
Greece                     2
Siam                       2
Australia                  2
Russia                     2
Mexico                     1
Japan                      1
Syria                      1
Yugoslavia                 1
Latvia                     1
Cuba                       1
Slovakia (Modern day)      1
China/Hong Kong            1
Egypt                      1
Name: country, dtype: int64

To your complete lack of surprise, enter GroupBy, to our rescue.

We can ask the GroupBy object to aggregate over the rows in each group. In this case we want to use the mean function to give us an aggregate value for the rows in each group. Here is one way to do that:

class_means = by_class.aggregate(np.mean)
class_means
age fare
class
1st 39.441584 84.256920
2nd 30.193916 21.543087
3rd 24.978557 12.649753

Notice the benefits over doing this by hand, above. We have to do less typing. We don’t have to know the unique values of class beforehand. We get a Pandas dataframe back, that displays nicely, and that we can index to get the values we want:

# Label indexing of the groupby results table.
class_means.loc['1st', 'age']
39.441583610188374

In fact, through some particularly obscure and clever Python magic, we can use mean and other Pandas data frame methods as methods of the GroupBy object, like this:

# Using mean function directly.
by_class.mean()
age fare
class
1st 39.441584 84.256920
2nd 30.193916 21.543087
3rd 24.978557 12.649753
# Using median function directly.
by_class.median()
age fare
class
1st 39.0 59.08
2nd 29.0 15.01
3rd 24.0 8.01
# Using sum function directly.
by_class.sum()
age fare
class
1st 11871.916667 25361.3328
2nd 7941.000000 5665.8318
3rd 16211.083333 8209.6894

The function we are using can return more than one value, in which case, we get a new column for each value. For example, the describe method returns various aggregate values for each column. Here’s describe in action on the main data frame:

df.describe()
age fare
count 1213.000000 1213.000000
mean 29.698269 32.346953
std 14.036480 48.214773
min 0.166667 3.030500
25% 21.000000 7.180600
50% 28.000000 14.100000
75% 38.000000 31.050600
max 74.000000 512.060700

When we apply describe to the groups, we get a new column for each calculated value that describe returns.

by_class.describe()
age fare
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
class
1st 301.0 39.441584 13.881000 0.916667 29.0 39.0 49.0 71.0 301.0 84.256920 72.979218 5.0000 31.00 59.08 93.1000 512.0607
2nd 263.0 30.193916 13.875730 0.583333 22.0 29.0 38.0 72.0 263.0 21.543087 13.555193 9.1309 13.00 15.01 26.0000 73.1000
3rd 649.0 24.978557 11.616721 0.166667 19.0 24.0 31.0 74.0 649.0 12.649753 10.969757 3.0305 7.15 8.01 15.0411 69.1100

So far we have allowed GroupBy to operate on all the columns - but sometimes we only want it to operate on a subset of the columns in the original data frame, and therefore, the columns for the groups.

Like the original data frame, GroupBy knows about the columns. For example, we can get a new GroupBy object, that contains only the age column, like so:

by_class['age']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x110422bb0>

This can be useful when we want to restrict our attention to the age column, and we are not interested in other columns, such as fare:

by_class['age'].mean()
class
1st    39.441584
2nd    30.193916
3rd    24.978557
Name: age, dtype: float64

We can get even more specific by passing keyword argument to the aggregate method. This allows us to say what aggregate columns we want back, and how to make them. Here we ask for:

  • An aggregate column mean_age, the result of applying np.mean to the age column of the groups.

  • Another aggregate column median_fare, the result of applying np.media to the fare column of the groups.

by_class.agg(mean_age=('age', np.mean),
             median_fare=('fare', np.median))
mean_age median_fare
class
1st 39.441584 59.08
2nd 30.193916 15.01
3rd 24.978557 8.01

Grouping by more than one column

GroupBy can also group by more than one set of categories.

For example, imagine we wanted to break down the rows of the data frame by both gender and class. That is, we want to form groups that are all the unique combinations of gender and class. These will be:

  • female and 1st

  • female and 2nd

  • female and 3rd

  • male and 1st

  • male and 2nd

  • male and 3rd

We can do this by giving groupby a list of the columns we want to categorize by:

by_gender_class = df.groupby(['gender', 'class'])
by_gender_class
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x110456130>
len(by_gender_class)
6

As before, GroupBy knows about each individual subgroup. For example, here we fetch the rows corresponding to gender female and class 3rd:

female_third = by_gender_class.get_group(('female', '3rd'))
female_third.head()
name gender age class embarked country fare survived
3 Abbott, Mrs. Rhoda Mary 'Rosa' female 39.0 3rd Southampton England 20.0500 yes
4 Abelseth, Miss. Karen Marie female 16.0 3rd Southampton Norway 7.1300 yes
12 Ahlin, Mrs. Johanna Persdotter female 40.0 3rd Southampton Sweden 9.0906 no
14 Aks, Mrs. Leah female 18.0 3rd Southampton England 9.0700 yes
28 Andersen-Jensen, Miss. Carla Christine Nielsine female 19.0 3rd Southampton Denmark 7.1701 yes

Note that the group specification above has to be a tuple (surrounded by ()) rather than a list (surrounded by []).

As before, this gives us the same result as if we had done the selection the long way round:

female_3rd = df[np.logical_and(df['gender'] == 'female', df['class'] == '3rd')]
female_3rd.head()
name gender age class embarked country fare survived
3 Abbott, Mrs. Rhoda Mary 'Rosa' female 39.0 3rd Southampton England 20.0500 yes
4 Abelseth, Miss. Karen Marie female 16.0 3rd Southampton Norway 7.1300 yes
12 Ahlin, Mrs. Johanna Persdotter female 40.0 3rd Southampton Sweden 9.0906 no
14 Aks, Mrs. Leah female 18.0 3rd Southampton England 9.0700 yes
28 Andersen-Jensen, Miss. Carla Christine Nielsine female 19.0 3rd Southampton Denmark 7.1701 yes

Our aggregation functions operate on all the subgroups:

by_gender_class.mean()
age fare
gender class
female 1st 37.015152 104.078811
2nd 28.870915 23.021025
3rd 23.593800 14.752567
male 1st 41.336785 68.774733
2nd 31.032091 20.606753
3rd 25.627074 11.664950

Again, please see the Python Data Science Handbook chapter for much more detail.