Cross-tabulation¶
import numpy as np
import pandas as pd
# Safe settings for Pandas.
pd.set_option('mode.chained_assignment', 'raise')
It is often useful to make tables that count rows in categories defined by columns.
Here we will use a dataset with data about passengers on the RMS Titanic.
See the Titanic dataset page for some more information about this dataset, and a link to download the data.
# Load the dataset
df = pd.read_csv('titanic_clean.csv')
df.head()
name | gender | age | class | embarked | country | fare | survived | |
---|---|---|---|---|---|---|---|---|
0 | Abbing, Mr. Anthony | male | 42.0 | 3rd | Southampton | United States | 7.11 | no |
1 | Abbott, Mr. Eugene Joseph | male | 13.0 | 3rd | Southampton | United States | 20.05 | no |
2 | Abbott, Mr. Rossmore Edward | male | 16.0 | 3rd | Southampton | United States | 20.05 | no |
3 | Abbott, Mrs. Rhoda Mary 'Rosa' | female | 39.0 | 3rd | Southampton | England | 20.05 | yes |
4 | Abelseth, Miss. Karen Marie | female | 16.0 | 3rd | Southampton | Norway | 7.13 | yes |
As you can see, we have various bits of information about the passengers on
the Titanic. Several columns give labels - categories. For example, the
gender
column categorizes the passenger as male
or female
, and the
survived
column categorizes the passenger by whether they survived the
sinking - yes
or no
.
A one-way table¶
As you know, we can use the value_counts
method of the data frame to see the
unique values for any column, with the counts of the number of times each
value occurs.
# Counts of each unique value in the 'survived' column.
df['survived'].value_counts()
no 743
yes 470
Name: survived, dtype: int64
Often we want to see proportions instead of counts. We can use the normalize
keyword argument for this:
# Proportions of each unique value in the 'survived' column.
df['survived'].value_counts(normalize=True)
no 0.612531
yes 0.387469
Name: survived, dtype: float64
Two-way tables¶
We often want to break down the rows by more than one category. Remember that
the Titanic passengers and crew tended to give preference to women and
children, when loading the lifeboats. So, we may want to see the counts of
passengers who survived, broken down by gender
.
This is the role of the Pandas crosstab
function. It is a Pandas function
because it is function inside the Pandas module; we can get this function with
pd.crosstab
(assuming we have done the usual import pandas as pd
).
The first argument to pd.crosstab
is the category we want to see in the rows; the second argument is the category we want to see in the columns.
Here is a cross-tabulation of gender
(in the rows) by survived
(in the columns):
# Cross-tabulation of counts for 'gender' (rows) by 'survived' (columns).
pd.crosstab(df['gender'], df['survived'])
survived | no | yes |
---|---|---|
gender | ||
female | 120 | 321 |
male | 623 | 149 |
We will often want to see these values as proportions rather than counts. For
example, we may be interested in the proportion of women and men that
survived. As for value_counts
above, we use the normalize
keyword to ask
for proportions. This time we have to specify the direction that Pandas
should use for the proportion. We could be interested in the proportion
across the column (proportions of male
and female
passengers within the
yes
“survived” category, likewise for the no
category). More likely, in
this case, we will be interested in proportions across the row (proportion who
survived within male
category, proportion who survived within female
category). We give Pandas this information with the value for the
normalize
keyword argument. Pandas uses the term index
to refer to the
rows. Remember, Pandas also uses the term index for the row labels.
# Cross-tabulation of proportions for 'gender' (rows) by 'survived' (columns).
# Proportions are over the row.
pd.crosstab(df['gender'], df['survived'], normalize='index')
survived | no | yes |
---|---|---|
gender | ||
female | 0.272109 | 0.727891 |
male | 0.806995 | 0.193005 |
We could also ask for the proportions over the columns.
# Cross-tabulation of proportions for 'gender' (rows) by 'survived' (columns).
# Proportions are over the column.
pd.crosstab(df['gender'], df['survived'], normalize='columns')
survived | no | yes |
---|---|---|
gender | ||
female | 0.161507 | 0.682979 |
male | 0.838493 | 0.317021 |
Of course, the arguments we are passing to pd.crosstab
are Pandas Series. We can pass any Pandas series as our row or column specification.
For example, we might want to make a new Series that categorizes passengers as
male
, female
or child
. We can use that Series as an argument to
pd.crosstab
.
mwc = df['gender'].copy() # Copy-right!
mwc[df['age'] <= 16] = 'child'
pd.crosstab(mwc, df['survived'], normalize='index')
survived | no | yes |
---|---|---|
gender | ||
child | 0.467153 | 0.532847 |
female | 0.257979 | 0.742021 |
male | 0.831429 | 0.168571 |
Three-way tables (or four …)¶
Another factor that we know was in play was passenger class. For various reasons, first class passengers, in particular, got better access to the lifeboats than passengers in other classes. This adds a further factor that might explain the numbers and proportions that survived.
Now we have two factors (categories) that we think may predict survival - class
and gender
.
pd.crosstab
takes this in its stride. We can just pass a list of Series as our row or column specification.
# Counts for passengers who survived, broken down by class and male, female,
# child.
categories = [df['class'], mwc]
pd.crosstab(categories, df['survived'])
survived | no | yes | |
---|---|---|---|
class | gender | ||
1st | child | 1 | 8 |
female | 4 | 124 | |
male | 109 | 55 | |
2nd | child | 1 | 25 |
female | 11 | 77 | |
male | 136 | 13 | |
3rd | child | 62 | 40 |
female | 82 | 78 | |
male | 337 | 50 |
normalize
over the rows will give proportions for each sub-category:
# Proportions surviving in sub-categories of class, male/female/child.
pd.crosstab(categories, df['survived'], normalize='index')
survived | no | yes | |
---|---|---|---|
class | gender | ||
1st | child | 0.111111 | 0.888889 |
female | 0.031250 | 0.968750 | |
male | 0.664634 | 0.335366 | |
2nd | child | 0.038462 | 0.961538 |
female | 0.125000 | 0.875000 | |
male | 0.912752 | 0.087248 | |
3rd | child | 0.607843 | 0.392157 |
female | 0.512500 | 0.487500 | |
male | 0.870801 | 0.129199 |
You can specify multiple Series for the columns of the table. Here we just run the equivalent table but reversing the rows and columns:
# Proportions surviving in sub-categories of class, male/female/child,
# with 'survived' in the rows, and the subcategories in the columns.
pd.crosstab(df['survived'], categories, normalize='columns')
class | 1st | 2nd | 3rd | ||||||
---|---|---|---|---|---|---|---|---|---|
gender | child | female | male | child | female | male | child | female | male |
survived | |||||||||
no | 0.111111 | 0.03125 | 0.664634 | 0.038462 | 0.125 | 0.912752 | 0.607843 | 0.5125 | 0.870801 |
yes | 0.888889 | 0.96875 | 0.335366 | 0.961538 | 0.875 | 0.087248 | 0.392157 | 0.4875 | 0.129199 |
You can specify more than two columns in the row or column arguments, or both. Try experimenting with different arguments.