The power of groupby¶
import numpy as np
import pandas as pd
# Safe settings for Pandas.
pd.set_option('mode.chained_assignment', 'raise')
This page follows from the Crosstab page.
As for the Crosstab page, we will use a dataset with data about passengers on the RMS Titanic.
See the Titanic dataset page for some more information about this dataset, and a link to download the data.
# Load the dataset
df = pd.read_csv('titanic_clean.csv')
df.head()
name | gender | age | class | embarked | country | fare | survived | |
---|---|---|---|---|---|---|---|---|
0 | Abbing, Mr. Anthony | male | 42.0 | 3rd | Southampton | United States | 7.11 | no |
1 | Abbott, Mr. Eugene Joseph | male | 13.0 | 3rd | Southampton | United States | 20.05 | no |
2 | Abbott, Mr. Rossmore Edward | male | 16.0 | 3rd | Southampton | United States | 20.05 | no |
3 | Abbott, Mrs. Rhoda Mary 'Rosa' | female | 39.0 | 3rd | Southampton | England | 20.05 | yes |
4 | Abelseth, Miss. Karen Marie | female | 16.0 | 3rd | Southampton | Norway | 7.13 | yes |
In the Crosstab page, we calculated counts and proportions of rows falling into categories given by Pandas Series. These Series are often columns from a Pandas data frame.
Pandas also has a very powerful mechanism for grouping observations by
category, available by the Pandas data frame groupby
method.
The transformations available with groupby
are far more general than those
in pd.crosstab
. Among many other things, groupby
allows us to calculate
many different summaries of rows in different categories.
For much more detail on groupby
, see the Python Data Science Handbook
chapter,
by Jake Vanderplas.
Just for example, imagine we were interested in the average age of the passengers in each class
.
Groups and GroupBy¶
We could start by making a Pandas groupby
object, like this:
by_class = df.groupby('class')
by_class
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10d70f730>
The GroupBy object has categorized the rows according to the values in the
class
column. Therefore, the GroupBy object has three groups, one for
1st
, one of 2nd
and one for 3rd
.
We can see how many groups we have with Python’s len
function:
# The number of groups.
len(by_class)
3
If the fancy takes us, we can fetch the rows corresponding to the groups with
the get_group
method.
first_class_passengers = by_class.get_group('1st')
first_class_passengers.head()
name | gender | age | class | embarked | country | fare | survived | |
---|---|---|---|---|---|---|---|---|
20 | Allen, Miss. Elisabeth Walton | female | 29.000000 | 1st | Southampton | United States | 211.6009 | yes |
22 | Allison, Master. Hudson Trevor | male | 0.916667 | 1st | Southampton | Canada | 151.1600 | yes |
23 | Allison, Miss. Helen Loraine | female | 2.000000 | 1st | Southampton | Canada | 151.1600 | no |
24 | Allison, Mr. Hudson Joshua Creighton | male | 30.000000 | 1st | Southampton | Canada | 151.1600 | no |
25 | Allison, Mrs. Bessie Waldo | female | 25.000000 | 1st | Southampton | United States | 151.1600 | no |
This is exactly the same result we would get by selecting the first class passenger rows directly:
first_again = df[df['class'] == '1st']
first_again.head()
name | gender | age | class | embarked | country | fare | survived | |
---|---|---|---|---|---|---|---|---|
20 | Allen, Miss. Elisabeth Walton | female | 29.000000 | 1st | Southampton | United States | 211.6009 | yes |
22 | Allison, Master. Hudson Trevor | male | 0.916667 | 1st | Southampton | Canada | 151.1600 | yes |
23 | Allison, Miss. Helen Loraine | female | 2.000000 | 1st | Southampton | Canada | 151.1600 | no |
24 | Allison, Mr. Hudson Joshua Creighton | male | 30.000000 | 1st | Southampton | Canada | 151.1600 | no |
25 | Allison, Mrs. Bessie Waldo | female | 25.000000 | 1st | Southampton | United States | 151.1600 | no |
Aggregating¶
The true power of GroupBy is in doing efficient operations on the groups, and returning the results.
Just for example, imagine that we wanted to get the mean age
and fare
for
each class. We could do this the long way round, by identifying each set of
rows, and taking their mean:
print('1st mean', df[df['class'] == '1st'].mean())
print('2nd mean', df[df['class'] == '2nd'].mean())
print('3rd mean', df[df['class'] == '3rd'].mean())
1st mean age 39.441584
fare 84.256920
dtype: float64
2nd mean age 30.193916
fare 21.543087
dtype: float64
3rd mean age 24.978557
fare 12.649753
dtype: float64
Notice that the mean
function only returned results for the numerical
columns. It did not, for example, try to take a mean of the passenger name
.
The code above is a repetitive and ugly - but it could have been much worse. Imagine, for example, that we had been interested in the mean age for passengers for each country. There are lots of countries; it would not be practical to type these all out by hand as we have above for the classes.
df['country'].value_counts()
England 323
United States 257
Ireland 119
Sweden 104
Lebanon 70
Finland 54
Canada 33
Norway 26
France 23
Belgium 22
Bulgaria 17
Switzerland 17
Scotland 16
Croatia (Modern) 12
Croatia 11
Channel Islands 10
Spain 8
Wales 8
Argentina 7
India 7
Hungary 7
Denmark 7
South Africa 6
Turkey 6
Germany 5
Bosnia 4
Slovenia 4
Italy 4
Poland 3
Uruguay 2
Peru 2
Austria 2
Greece 2
Siam 2
Australia 2
Russia 2
Mexico 1
Japan 1
Syria 1
Yugoslavia 1
Latvia 1
Cuba 1
Slovakia (Modern day) 1
China/Hong Kong 1
Egypt 1
Name: country, dtype: int64
To your complete lack of surprise, enter GroupBy, to our rescue.
We can ask the GroupBy object to aggregate over the rows in each group. In this case we want to use the mean
function to give us an aggregate value for the rows in each group. Here is one way to do that:
class_means = by_class.aggregate(np.mean)
class_means
age | fare | |
---|---|---|
class | ||
1st | 39.441584 | 84.256920 |
2nd | 30.193916 | 21.543087 |
3rd | 24.978557 | 12.649753 |
Notice the benefits over doing this by hand, above. We have to do less typing. We don’t have to know the unique values of class
beforehand. We get a Pandas dataframe back, that displays nicely, and that we can index to get the values we want:
# Label indexing of the groupby results table.
class_means.loc['1st', 'age']
39.441583610188374
In fact, through some particularly obscure and clever Python magic, we can use
mean
and other Pandas data frame methods as methods of the GroupBy object,
like this:
# Using mean function directly.
by_class.mean()
age | fare | |
---|---|---|
class | ||
1st | 39.441584 | 84.256920 |
2nd | 30.193916 | 21.543087 |
3rd | 24.978557 | 12.649753 |
# Using median function directly.
by_class.median()
age | fare | |
---|---|---|
class | ||
1st | 39.0 | 59.08 |
2nd | 29.0 | 15.01 |
3rd | 24.0 | 8.01 |
# Using sum function directly.
by_class.sum()
age | fare | |
---|---|---|
class | ||
1st | 11871.916667 | 25361.3328 |
2nd | 7941.000000 | 5665.8318 |
3rd | 16211.083333 | 8209.6894 |
The function we are using can return more than one value, in which case, we
get a new column for each value. For example, the describe
method returns
various aggregate values for each column. Here’s describe
in action on the
main data frame:
df.describe()
age | fare | |
---|---|---|
count | 1213.000000 | 1213.000000 |
mean | 29.698269 | 32.346953 |
std | 14.036480 | 48.214773 |
min | 0.166667 | 3.030500 |
25% | 21.000000 | 7.180600 |
50% | 28.000000 | 14.100000 |
75% | 38.000000 | 31.050600 |
max | 74.000000 | 512.060700 |
When we apply describe
to the groups, we get a new column for each calculated value that describe
returns.
by_class.describe()
age | fare | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
class | ||||||||||||||||
1st | 301.0 | 39.441584 | 13.881000 | 0.916667 | 29.0 | 39.0 | 49.0 | 71.0 | 301.0 | 84.256920 | 72.979218 | 5.0000 | 31.00 | 59.08 | 93.1000 | 512.0607 |
2nd | 263.0 | 30.193916 | 13.875730 | 0.583333 | 22.0 | 29.0 | 38.0 | 72.0 | 263.0 | 21.543087 | 13.555193 | 9.1309 | 13.00 | 15.01 | 26.0000 | 73.1000 |
3rd | 649.0 | 24.978557 | 11.616721 | 0.166667 | 19.0 | 24.0 | 31.0 | 74.0 | 649.0 | 12.649753 | 10.969757 | 3.0305 | 7.15 | 8.01 | 15.0411 | 69.1100 |
So far we have allowed GroupBy to operate on all the columns - but sometimes we only want it to operate on a subset of the columns in the original data frame, and therefore, the columns for the groups.
Like the original data frame, GroupBy knows about the columns. For example,
we can get a new GroupBy object, that contains only the age
column, like so:
by_class['age']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x110422bb0>
This can be useful when we want to restrict our attention to the age
column, and we are not interested in other columns, such as fare
:
by_class['age'].mean()
class
1st 39.441584
2nd 30.193916
3rd 24.978557
Name: age, dtype: float64
We can get even more specific by passing keyword argument to the aggregate
method. This allows us to say what aggregate columns we want back, and how to make them. Here we ask for:
An aggregate column
mean_age
, the result of applyingnp.mean
to theage
column of the groups.Another aggregate column
median_fare
, the result of applyingnp.media
to thefare
column of the groups.
by_class.agg(mean_age=('age', np.mean),
median_fare=('fare', np.median))
mean_age | median_fare | |
---|---|---|
class | ||
1st | 39.441584 | 59.08 |
2nd | 30.193916 | 15.01 |
3rd | 24.978557 | 8.01 |
Grouping by more than one column¶
GroupBy can also group by more than one set of categories.
For example, imagine we wanted to break down the rows of the data frame by both gender
and class
. That is, we want to form groups that are all the unique combinations of gender
and class. These will be:
female
and1st
female
and2nd
female
and3rd
male
and1st
male
and2nd
male
and3rd
We can do this by giving groupby
a list of the columns we want to categorize
by:
by_gender_class = df.groupby(['gender', 'class'])
by_gender_class
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x110456130>
len(by_gender_class)
6
As before, GroupBy knows about each individual subgroup. For example, here we fetch the rows corresponding to gender female
and class 3rd
:
female_third = by_gender_class.get_group(('female', '3rd'))
female_third.head()
name | gender | age | class | embarked | country | fare | survived | |
---|---|---|---|---|---|---|---|---|
3 | Abbott, Mrs. Rhoda Mary 'Rosa' | female | 39.0 | 3rd | Southampton | England | 20.0500 | yes |
4 | Abelseth, Miss. Karen Marie | female | 16.0 | 3rd | Southampton | Norway | 7.1300 | yes |
12 | Ahlin, Mrs. Johanna Persdotter | female | 40.0 | 3rd | Southampton | Sweden | 9.0906 | no |
14 | Aks, Mrs. Leah | female | 18.0 | 3rd | Southampton | England | 9.0700 | yes |
28 | Andersen-Jensen, Miss. Carla Christine Nielsine | female | 19.0 | 3rd | Southampton | Denmark | 7.1701 | yes |
Note that the group specification above has to be a tuple (surrounded by
()
) rather than a list (surrounded by []
).
As before, this gives us the same result as if we had done the selection the long way round:
female_3rd = df[np.logical_and(df['gender'] == 'female', df['class'] == '3rd')]
female_3rd.head()
name | gender | age | class | embarked | country | fare | survived | |
---|---|---|---|---|---|---|---|---|
3 | Abbott, Mrs. Rhoda Mary 'Rosa' | female | 39.0 | 3rd | Southampton | England | 20.0500 | yes |
4 | Abelseth, Miss. Karen Marie | female | 16.0 | 3rd | Southampton | Norway | 7.1300 | yes |
12 | Ahlin, Mrs. Johanna Persdotter | female | 40.0 | 3rd | Southampton | Sweden | 9.0906 | no |
14 | Aks, Mrs. Leah | female | 18.0 | 3rd | Southampton | England | 9.0700 | yes |
28 | Andersen-Jensen, Miss. Carla Christine Nielsine | female | 19.0 | 3rd | Southampton | Denmark | 7.1701 | yes |
Our aggregation functions operate on all the subgroups:
by_gender_class.mean()
age | fare | ||
---|---|---|---|
gender | class | ||
female | 1st | 37.015152 | 104.078811 |
2nd | 28.870915 | 23.021025 | |
3rd | 23.593800 | 14.752567 | |
male | 1st | 41.336785 | 68.774733 |
2nd | 31.032091 | 20.606753 | |
3rd | 25.627074 | 11.664950 |
Again, please see the Python Data Science Handbook chapter for much more detail.