Cross-tabulation¶

import numpy as np
import pandas as pd
# Safe settings for Pandas.
pd.set_option('mode.chained_assignment', 'raise')

It is often useful to make tables that count rows in categories defined by columns.

Here we will use a dataset with data about passengers on the RMS Titanic.

See the Titanic dataset page for some more information about this dataset, and a link to download the data.

# Load the dataset
df = pd.read_csv('titanic_clean.csv')
df.head()

	name	gender	age	class	embarked	country	fare	survived
0	Abbing, Mr. Anthony	male	42.0	3rd	Southampton	United States	7.11	no
1	Abbott, Mr. Eugene Joseph	male	13.0	3rd	Southampton	United States	20.05	no
2	Abbott, Mr. Rossmore Edward	male	16.0	3rd	Southampton	United States	20.05	no
3	Abbott, Mrs. Rhoda Mary 'Rosa'	female	39.0	3rd	Southampton	England	20.05	yes
4	Abelseth, Miss. Karen Marie	female	16.0	3rd	Southampton	Norway	7.13	yes

As you can see, we have various bits of information about the passengers on the Titanic. Several columns give labels - categories. For example, the gender column categorizes the passenger as male or female, and the survived column categorizes the passenger by whether they survived the sinking - yes or no.

A one-way table¶

As you know, we can use the value_counts method of the data frame to see the unique values for any column, with the counts of the number of times each value occurs.

# Counts of each unique value in the 'survived' column.
df['survived'].value_counts()

no     743
yes    470
Name: survived, dtype: int64

Often we want to see proportions instead of counts. We can use the normalize keyword argument for this:

# Proportions of each unique value in the 'survived' column.
df['survived'].value_counts(normalize=True)

no     0.612531
yes    0.387469
Name: survived, dtype: float64

Two-way tables¶

We often want to break down the rows by more than one category. Remember that the Titanic passengers and crew tended to give preference to women and children, when loading the lifeboats. So, we may want to see the counts of passengers who survived, broken down by gender.

This is the role of the Pandas crosstab function. It is a Pandas function because it is function inside the Pandas module; we can get this function with pd.crosstab (assuming we have done the usual import pandas as pd).

The first argument to pd.crosstab is the category we want to see in the rows; the second argument is the category we want to see in the columns.

Here is a cross-tabulation of gender (in the rows) by survived (in the columns):

# Cross-tabulation of counts for 'gender' (rows) by 'survived' (columns).
pd.crosstab(df['gender'], df['survived'])

survived	no	yes
gender
female	120	321
male	623	149

We will often want to see these values as proportions rather than counts. For example, we may be interested in the proportion of women and men that survived. As for value_counts above, we use the normalize keyword to ask for proportions. This time we have to specify the direction that Pandas should use for the proportion. We could be interested in the proportion across the column (proportions of male and female passengers within the yes “survived” category, likewise for the no category). More likely, in this case, we will be interested in proportions across the row (proportion who survived within male category, proportion who survived within female category). We give Pandas this information with the value for the normalize keyword argument. Pandas uses the term index to refer to the rows. Remember, Pandas also uses the term index for the row labels.

# Cross-tabulation of proportions for 'gender' (rows) by 'survived' (columns).
# Proportions are over the row.
pd.crosstab(df['gender'], df['survived'], normalize='index')

survived	no	yes
gender
female	0.272109	0.727891
male	0.806995	0.193005

We could also ask for the proportions over the columns.

# Cross-tabulation of proportions for 'gender' (rows) by 'survived' (columns).
# Proportions are over the column.
pd.crosstab(df['gender'], df['survived'], normalize='columns')

survived	no	yes
gender
female	0.161507	0.682979
male	0.838493	0.317021

Of course, the arguments we are passing to pd.crosstab are Pandas Series. We can pass any Pandas series as our row or column specification.

For example, we might want to make a new Series that categorizes passengers as male, female or child. We can use that Series as an argument to pd.crosstab.

mwc = df['gender'].copy()  # Copy-right!
mwc[df['age'] <= 16] = 'child'
pd.crosstab(mwc, df['survived'], normalize='index')

survived	no	yes
gender
child	0.467153	0.532847
female	0.257979	0.742021
male	0.831429	0.168571

Three-way tables (or four …)¶

Another factor that we know was in play was passenger class. For various reasons, first class passengers, in particular, got better access to the lifeboats than passengers in other classes. This adds a further factor that might explain the numbers and proportions that survived.

Now we have two factors (categories) that we think may predict survival - class and gender.

pd.crosstab takes this in its stride. We can just pass a list of Series as our row or column specification.

# Counts for passengers who survived, broken down by class and male, female,
# child.
categories = [df['class'], mwc]
pd.crosstab(categories, df['survived'])

	survived	no	yes
class	gender
1st	child	1	8
	female	4	124
	male	109	55
2nd	child	1	25
	female	11	77
	male	136	13
3rd	child	62	40
	female	82	78
	male	337	50

normalize over the rows will give proportions for each sub-category:

# Proportions surviving in sub-categories of class, male/female/child.
pd.crosstab(categories, df['survived'], normalize='index')

	survived	no	yes
class	gender
1st	child	0.111111	0.888889
	female	0.031250	0.968750
	male	0.664634	0.335366
2nd	child	0.038462	0.961538
	female	0.125000	0.875000
	male	0.912752	0.087248
3rd	child	0.607843	0.392157
	female	0.512500	0.487500
	male	0.870801	0.129199

You can specify multiple Series for the columns of the table. Here we just run the equivalent table but reversing the rows and columns:

# Proportions surviving in sub-categories of class, male/female/child,
# with 'survived' in the rows, and the subcategories in the columns.
pd.crosstab(df['survived'], categories, normalize='columns')

class	1st			2nd			3rd
gender	child	female	male	child	female	male	child	female	male
survived
no	0.111111	0.03125	0.664634	0.038462	0.125	0.912752	0.607843	0.5125	0.870801
yes	0.888889	0.96875	0.335366	0.961538	0.875	0.087248	0.392157	0.4875	0.129199

You can specify more than two columns in the row or column arguments, or both. Try experimenting with different arguments.

Coding for Data - 2020 edition

Cross-tabulation¶

A one-way table¶

Two-way tables¶

Three-way tables (or four …)¶