The power of groupby¶

import numpy as np
import pandas as pd
# Safe settings for Pandas.
pd.set_option('mode.chained_assignment', 'raise')

This page follows from the Crosstab page.

As for the Crosstab page, we will use a dataset with data about passengers on the RMS Titanic.

See the Titanic dataset page for some more information about this dataset, and a link to download the data.

# Load the dataset
df = pd.read_csv('titanic_clean.csv')
df.head()

	name	gender	age	class	embarked	country	fare	survived
0	Abbing, Mr. Anthony	male	42.0	3rd	Southampton	United States	7.11	no
1	Abbott, Mr. Eugene Joseph	male	13.0	3rd	Southampton	United States	20.05	no
2	Abbott, Mr. Rossmore Edward	male	16.0	3rd	Southampton	United States	20.05	no
3	Abbott, Mrs. Rhoda Mary 'Rosa'	female	39.0	3rd	Southampton	England	20.05	yes
4	Abelseth, Miss. Karen Marie	female	16.0	3rd	Southampton	Norway	7.13	yes

In the Crosstab page, we calculated counts and proportions of rows falling into categories given by Pandas Series. These Series are often columns from a Pandas data frame.

Pandas also has a very powerful mechanism for grouping observations by category, available by the Pandas data frame groupby method.

The transformations available with groupby are far more general than those in pd.crosstab. Among many other things, groupby allows us to calculate many different summaries of rows in different categories.

For much more detail on groupby, see the Python Data Science Handbook chapter, by Jake Vanderplas.

Just for example, imagine we were interested in the average age of the passengers in each class.

Groups and GroupBy¶

We could start by making a Pandas groupby object, like this:

by_class = df.groupby('class')
by_class

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10d70f730>

The GroupBy object has categorized the rows according to the values in the class column. Therefore, the GroupBy object has three groups, one for 1st, one of 2nd and one for 3rd.

We can see how many groups we have with Python’s len function:

# The number of groups.
len(by_class)

If the fancy takes us, we can fetch the rows corresponding to the groups with the get_group method.

first_class_passengers = by_class.get_group('1st')
first_class_passengers.head()

	name	gender	age	class	embarked	country	fare	survived
20	Allen, Miss. Elisabeth Walton	female	29.000000	1st	Southampton	United States	211.6009	yes
22	Allison, Master. Hudson Trevor	male	0.916667	1st	Southampton	Canada	151.1600	yes
23	Allison, Miss. Helen Loraine	female	2.000000	1st	Southampton	Canada	151.1600	no
24	Allison, Mr. Hudson Joshua Creighton	male	30.000000	1st	Southampton	Canada	151.1600	no
25	Allison, Mrs. Bessie Waldo	female	25.000000	1st	Southampton	United States	151.1600	no

This is exactly the same result we would get by selecting the first class passenger rows directly:

first_again = df[df['class'] == '1st']
first_again.head()

	name	gender	age	class	embarked	country	fare	survived
20	Allen, Miss. Elisabeth Walton	female	29.000000	1st	Southampton	United States	211.6009	yes
22	Allison, Master. Hudson Trevor	male	0.916667	1st	Southampton	Canada	151.1600	yes
23	Allison, Miss. Helen Loraine	female	2.000000	1st	Southampton	Canada	151.1600	no
24	Allison, Mr. Hudson Joshua Creighton	male	30.000000	1st	Southampton	Canada	151.1600	no
25	Allison, Mrs. Bessie Waldo	female	25.000000	1st	Southampton	United States	151.1600	no

Aggregating¶

The true power of GroupBy is in doing efficient operations on the groups, and returning the results.

Just for example, imagine that we wanted to get the mean age and fare for each class. We could do this the long way round, by identifying each set of rows, and taking their mean:

print('1st mean', df[df['class'] == '1st'].mean())
print('2nd mean', df[df['class'] == '2nd'].mean())
print('3rd mean', df[df['class'] == '3rd'].mean())

1st mean age     39.441584
fare    84.256920
dtype: float64
2nd mean age     30.193916
fare    21.543087
dtype: float64
3rd mean age     24.978557
fare    12.649753
dtype: float64

Notice that the mean function only returned results for the numerical columns. It did not, for example, try to take a mean of the passenger name.

The code above is a repetitive and ugly - but it could have been much worse. Imagine, for example, that we had been interested in the mean age for passengers for each country. There are lots of countries; it would not be practical to type these all out by hand as we have above for the classes.

df['country'].value_counts()

England                  323
United States            257
Ireland                  119
Sweden                   104
Lebanon                   70
Finland                   54
Canada                    33
Norway                    26
France                    23
Belgium                   22
Bulgaria                  17
Switzerland               17
Scotland                  16
Croatia (Modern)          12
Croatia                   11
Channel Islands           10
Spain                      8
Wales                      8
Argentina                  7
India                      7
Hungary                    7
Denmark                    7
South Africa               6
Turkey                     6
Germany                    5
Bosnia                     4
Slovenia                   4
Italy                      4
Poland                     3
Uruguay                    2
Peru                       2
Austria                    2
Greece                     2
Siam                       2
Australia                  2
Russia                     2
Mexico                     1
Japan                      1
Syria                      1
Yugoslavia                 1
Latvia                     1
Cuba                       1
Slovakia (Modern day)      1
China/Hong Kong            1
Egypt                      1
Name: country, dtype: int64

To your complete lack of surprise, enter GroupBy, to our rescue.

We can ask the GroupBy object to aggregate over the rows in each group. In this case we want to use the mean function to give us an aggregate value for the rows in each group. Here is one way to do that:

class_means = by_class.aggregate(np.mean)
class_means

	age	fare
class
1st	39.441584	84.256920
2nd	30.193916	21.543087
3rd	24.978557	12.649753

Notice the benefits over doing this by hand, above. We have to do less typing. We don’t have to know the unique values of class beforehand. We get a Pandas dataframe back, that displays nicely, and that we can index to get the values we want:

# Label indexing of the groupby results table.
class_means.loc['1st', 'age']

39.441583610188374

In fact, through some particularly obscure and clever Python magic, we can use mean and other Pandas data frame methods as methods of the GroupBy object, like this:

# Using mean function directly.
by_class.mean()

	age	fare
class
1st	39.441584	84.256920
2nd	30.193916	21.543087
3rd	24.978557	12.649753

# Using median function directly.
by_class.median()

	age	fare
class
1st	39.0	59.08
2nd	29.0	15.01
3rd	24.0	8.01

# Using sum function directly.
by_class.sum()

	age	fare
class
1st	11871.916667	25361.3328
2nd	7941.000000	5665.8318
3rd	16211.083333	8209.6894

The function we are using can return more than one value, in which case, we get a new column for each value. For example, the describe method returns various aggregate values for each column. Here’s describe in action on the main data frame:

df.describe()

	age	fare
count	1213.000000	1213.000000
mean	29.698269	32.346953
std	14.036480	48.214773
min	0.166667	3.030500
25%	21.000000	7.180600
50%	28.000000	14.100000
75%	38.000000	31.050600
max	74.000000	512.060700

When we apply describe to the groups, we get a new column for each calculated value that describe returns.

by_class.describe()

	age								fare
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
class
1st	301.0	39.441584	13.881000	0.916667	29.0	39.0	49.0	71.0	301.0	84.256920	72.979218	5.0000	31.00	59.08	93.1000	512.0607
2nd	263.0	30.193916	13.875730	0.583333	22.0	29.0	38.0	72.0	263.0	21.543087	13.555193	9.1309	13.00	15.01	26.0000	73.1000
3rd	649.0	24.978557	11.616721	0.166667	19.0	24.0	31.0	74.0	649.0	12.649753	10.969757	3.0305	7.15	8.01	15.0411	69.1100

So far we have allowed GroupBy to operate on all the columns - but sometimes we only want it to operate on a subset of the columns in the original data frame, and therefore, the columns for the groups.

Like the original data frame, GroupBy knows about the columns. For example, we can get a new GroupBy object, that contains only the age column, like so:

by_class['age']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x110422bb0>

This can be useful when we want to restrict our attention to the age column, and we are not interested in other columns, such as fare:

by_class['age'].mean()

class
1st    39.441584
2nd    30.193916
3rd    24.978557
Name: age, dtype: float64

We can get even more specific by passing keyword argument to the aggregate method. This allows us to say what aggregate columns we want back, and how to make them. Here we ask for:

An aggregate column mean_age, the result of applying np.mean to the age column of the groups.
Another aggregate column median_fare, the result of applying np.media to the fare column of the groups.

by_class.agg(mean_age=('age', np.mean),
             median_fare=('fare', np.median))

	mean_age	median_fare
class
1st	39.441584	59.08
2nd	30.193916	15.01
3rd	24.978557	8.01

Grouping by more than one column¶

GroupBy can also group by more than one set of categories.

For example, imagine we wanted to break down the rows of the data frame by both gender and class. That is, we want to form groups that are all the unique combinations of gender and class. These will be:

female and 1st
female and 2nd
female and 3rd
male and 1st
male and 2nd
male and 3rd

We can do this by giving groupby a list of the columns we want to categorize by:

by_gender_class = df.groupby(['gender', 'class'])
by_gender_class

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x110456130>

len(by_gender_class)

As before, GroupBy knows about each individual subgroup. For example, here we fetch the rows corresponding to gender female and class 3rd:

female_third = by_gender_class.get_group(('female', '3rd'))
female_third.head()

	name	gender	age	class	embarked	country	fare	survived
3	Abbott, Mrs. Rhoda Mary 'Rosa'	female	39.0	3rd	Southampton	England	20.0500	yes
4	Abelseth, Miss. Karen Marie	female	16.0	3rd	Southampton	Norway	7.1300	yes
12	Ahlin, Mrs. Johanna Persdotter	female	40.0	3rd	Southampton	Sweden	9.0906	no
14	Aks, Mrs. Leah	female	18.0	3rd	Southampton	England	9.0700	yes
28	Andersen-Jensen, Miss. Carla Christine Nielsine	female	19.0	3rd	Southampton	Denmark	7.1701	yes

Note that the group specification above has to be a tuple (surrounded by ()) rather than a list (surrounded by []).

As before, this gives us the same result as if we had done the selection the long way round:

female_3rd = df[np.logical_and(df['gender'] == 'female', df['class'] == '3rd')]
female_3rd.head()

	name	gender	age	class	embarked	country	fare	survived
3	Abbott, Mrs. Rhoda Mary 'Rosa'	female	39.0	3rd	Southampton	England	20.0500	yes
4	Abelseth, Miss. Karen Marie	female	16.0	3rd	Southampton	Norway	7.1300	yes
12	Ahlin, Mrs. Johanna Persdotter	female	40.0	3rd	Southampton	Sweden	9.0906	no
14	Aks, Mrs. Leah	female	18.0	3rd	Southampton	England	9.0700	yes
28	Andersen-Jensen, Miss. Carla Christine Nielsine	female	19.0	3rd	Southampton	Denmark	7.1701	yes

Our aggregation functions operate on all the subgroups:

by_gender_class.mean()

		age	fare
gender	class
female	1st	37.015152	104.078811
	2nd	28.870915	23.021025
	3rd	23.593800	14.752567
male	1st	41.336785	68.774733
	2nd	31.032091	20.606753
	3rd	25.627074	11.664950

Again, please see the Python Data Science Handbook chapter for much more detail.

Coding for Data - 2020 edition

The power of groupby¶

Groups and GroupBy¶

Aggregating¶

Grouping by more than one column¶