Cross-tabulation

import numpy as np
import pandas as pd
# Safe settings for Pandas.
pd.set_option('mode.chained_assignment', 'raise')

It is often useful to make tables that count rows in categories defined by columns.

Here we will use a dataset with data about passengers on the RMS Titanic.

See the Titanic dataset page for some more information about this dataset, and a link to download the data.

# Load the dataset
df = pd.read_csv('titanic_clean.csv')
df.head()
name gender age class embarked country fare survived
0 Abbing, Mr. Anthony male 42.0 3rd Southampton United States 7.11 no
1 Abbott, Mr. Eugene Joseph male 13.0 3rd Southampton United States 20.05 no
2 Abbott, Mr. Rossmore Edward male 16.0 3rd Southampton United States 20.05 no
3 Abbott, Mrs. Rhoda Mary 'Rosa' female 39.0 3rd Southampton England 20.05 yes
4 Abelseth, Miss. Karen Marie female 16.0 3rd Southampton Norway 7.13 yes

As you can see, we have various bits of information about the passengers on the Titanic. Several columns give labels - categories. For example, the gender column categorizes the passenger as male or female, and the survived column categorizes the passenger by whether they survived the sinking - yes or no.

A one-way table

As you know, we can use the value_counts method of the data frame to see the unique values for any column, with the counts of the number of times each value occurs.

# Counts of each unique value in the 'survived' column.
df['survived'].value_counts()
no     743
yes    470
Name: survived, dtype: int64

Often we want to see proportions instead of counts. We can use the normalize keyword argument for this:

# Proportions of each unique value in the 'survived' column.
df['survived'].value_counts(normalize=True)
no     0.612531
yes    0.387469
Name: survived, dtype: float64

Two-way tables

We often want to break down the rows by more than one category. Remember that the Titanic passengers and crew tended to give preference to women and children, when loading the lifeboats. So, we may want to see the counts of passengers who survived, broken down by gender.

This is the role of the Pandas crosstab function. It is a Pandas function because it is function inside the Pandas module; we can get this function with pd.crosstab (assuming we have done the usual import pandas as pd).

The first argument to pd.crosstab is the category we want to see in the rows; the second argument is the category we want to see in the columns.

Here is a cross-tabulation of gender (in the rows) by survived (in the columns):

# Cross-tabulation of counts for 'gender' (rows) by 'survived' (columns).
pd.crosstab(df['gender'], df['survived'])
survived no yes
gender
female 120 321
male 623 149

We will often want to see these values as proportions rather than counts. For example, we may be interested in the proportion of women and men that survived. As for value_counts above, we use the normalize keyword to ask for proportions. This time we have to specify the direction that Pandas should use for the proportion. We could be interested in the proportion across the column (proportions of male and female passengers within the yes “survived” category, likewise for the no category). More likely, in this case, we will be interested in proportions across the row (proportion who survived within male category, proportion who survived within female category). We give Pandas this information with the value for the normalize keyword argument. Pandas uses the term index to refer to the rows. Remember, Pandas also uses the term index for the row labels.

# Cross-tabulation of proportions for 'gender' (rows) by 'survived' (columns).
# Proportions are over the row.
pd.crosstab(df['gender'], df['survived'], normalize='index')
survived no yes
gender
female 0.272109 0.727891
male 0.806995 0.193005

We could also ask for the proportions over the columns.

# Cross-tabulation of proportions for 'gender' (rows) by 'survived' (columns).
# Proportions are over the column.
pd.crosstab(df['gender'], df['survived'], normalize='columns')
survived no yes
gender
female 0.161507 0.682979
male 0.838493 0.317021

Of course, the arguments we are passing to pd.crosstab are Pandas Series. We can pass any Pandas series as our row or column specification.

For example, we might want to make a new Series that categorizes passengers as male, female or child. We can use that Series as an argument to pd.crosstab.

mwc = df['gender'].copy()  # Copy-right!
mwc[df['age'] <= 16] = 'child'
pd.crosstab(mwc, df['survived'], normalize='index')
survived no yes
gender
child 0.467153 0.532847
female 0.257979 0.742021
male 0.831429 0.168571

Three-way tables (or four …)

Another factor that we know was in play was passenger class. For various reasons, first class passengers, in particular, got better access to the lifeboats than passengers in other classes. This adds a further factor that might explain the numbers and proportions that survived.

Now we have two factors (categories) that we think may predict survival - class and gender.

pd.crosstab takes this in its stride. We can just pass a list of Series as our row or column specification.

# Counts for passengers who survived, broken down by class and male, female,
# child.
categories = [df['class'], mwc]
pd.crosstab(categories, df['survived'])
survived no yes
class gender
1st child 1 8
female 4 124
male 109 55
2nd child 1 25
female 11 77
male 136 13
3rd child 62 40
female 82 78
male 337 50

normalize over the rows will give proportions for each sub-category:

# Proportions surviving in sub-categories of class, male/female/child.
pd.crosstab(categories, df['survived'], normalize='index')
survived no yes
class gender
1st child 0.111111 0.888889
female 0.031250 0.968750
male 0.664634 0.335366
2nd child 0.038462 0.961538
female 0.125000 0.875000
male 0.912752 0.087248
3rd child 0.607843 0.392157
female 0.512500 0.487500
male 0.870801 0.129199

You can specify multiple Series for the columns of the table. Here we just run the equivalent table but reversing the rows and columns:

# Proportions surviving in sub-categories of class, male/female/child,
# with 'survived' in the rows, and the subcategories in the columns.
pd.crosstab(df['survived'], categories, normalize='columns')
class 1st 2nd 3rd
gender child female male child female male child female male
survived
no 0.111111 0.03125 0.664634 0.038462 0.125 0.912752 0.607843 0.5125 0.870801
yes 0.888889 0.96875 0.335366 0.961538 0.875 0.087248 0.392157 0.4875 0.129199

You can specify more than two columns in the row or column arguments, or both. Try experimenting with different arguments.