General linear model and F tests

Download notebook Interact

import numpy as np  # The array library
import numpy.linalg as npl  # The linear algebra sub-package
# Only show 4 decimals when printing
np.set_printoptions(precision=6)

import matplotlib.pyplot as plt
%matplotlib inline

The problem, again

These two lists of numbers are from the GLM intro notebook:

psychopathy = [11.416,   4.514,  12.204,  14.835,
                8.416,   6.563,  17.343, 13.02,
               15.19 ,  11.902,  22.721,  22.324]

clammy = [0.389,  0.2  ,  0.241,  0.463,
          4.585,  1.097,  1.642,  4.972,
          7.957,  5.585,  5.527,  6.964]

Let’s first check what R gives us for this simple regression model (you will need R installed for this):

# This is code to copy-paste into R
psychopathy <- c(11.416,   4.514,  12.204,  14.835, 8.416,   6.563,  17.343,
                13.02, 15.19 ,  11.902,  22.721,  22.324)
clammy = c(0.389,  0.2  ,  0.241,  0.463, 4.585,  1.097,  1.642,  4.972,
           7.957,  5.585,  5.527,  6.964)

simple_regression = lm(psychopathy ~ clammy)
print(summary(simple_regression))

Here are the utilities to show the design matrix graphically, from the GLM intro:

def scale_design_mtx(X):
    """utility to scale the design matrix for display

    This scales the columns to their own range so we can see the variations
    across the column for all the columns, regardless of the scaling of the
    column.
    """
    mi, ma = X.min(axis=0), X.max(axis=0)
    # Vector that is True for columns where values are not
    # all almost equal to each other
    col_neq = (ma - mi) > 1.e-8
    Xs = np.ones_like(X)
    # Leave columns with same value throughout with 1s
    # Scale other columns to min, max in column
    mi = mi[col_neq]
    ma = ma[col_neq]
    Xs[:,col_neq] = (X[:,col_neq] - mi)/(ma - mi)
    return Xs

def show_design(X, design_title):
    """ Show the design matrix nicely """
    plt.imshow(scale_design_mtx(X),
               interpolation='nearest',
               cmap='gray') # Gray colormap
    plt.title(design_title)

Remember, the first four students were from Berkeley, the second group of four were from Stanford, and the third group were from MIT.

university = [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]
plt.plot(university, psychopathy, '+')
plt.xlabel('University')
plt.ylabel('Psychopathy')

Text(0, 0.5, 'Psychopathy')

png

$\newcommand{Xmat}{\boldsymbol X}$ The F test needs a full model and a reduced model. Call the full model $\Xmat_f$ and the reduced model $\Xmat_r$. The formula for the F test is:

$F_{\nu_1, \nu_2} = \frac{ (\textrm{SSR}(\Xmat_r) - \textrm{SSR}(\Xmat_f)) / \nu_1} {\textrm{SSR}(\Xmat_f) / \nu_2}$

where $\nu_1$ is called the numerator degrees of freedom, and $\nu_2$ is the denominator degrees of freedom or the degrees of freedom of the error.

$\nu_2$ is also the degrees of freedom for the full model. $\nu_1$ is given by $d - \nu_2$ where $d$ is the degrees of freedom from the reduced model.

First, let’s see what R does:

# R script for one-way ANOVA F test
psychopathy <- c(11.416,   4.514,  12.204,  14.835, 8.416,   6.563,  17.343,
                13.02, 15.19 ,  11.902,  22.721,  22.324)

university <- factor(c(rep('Berkeley', 4),
                       rep('Stanford', 4),
                       rep('MIT', 4)))

one_way_anova = lm(psychopathy ~ university)
print(summary(one_way_anova))

And now, an F test, the traditional way.

This is the standard F test you may be used to, for a one-way analysis of variance.

The reduced model is a model that just has the mean for all the observations, independent of group (university).

The full model allows each group (university) to have its own mean. The F test looks at the reduction in variance when we include the group means into the model.

overall_mean = np.mean(psychopathy)
overall_mean

13.370666666666667

overall_mean_corrected = psychopathy - overall_mean
overall_mean_corrected

array([-1.954667, -8.856667, -1.166667,  1.464333, -4.954667, -6.807667,
        3.972333, -0.350667,  1.819333, -1.468667,  9.350333,  8.953333])

# What do you think this will be?
np.mean(overall_mean_corrected)

0.0

berkeley_mean = np.mean(psychopathy[:4])
stanford_mean = np.mean(psychopathy[4:8])
mit_mean = np.mean(psychopathy[8:])
berkeley_mean, stanford_mean, mit_mean

(10.74225, 11.3355, 18.03425)

# Subtract the matching group mean from each row
group_mean_corrected = np.array(psychopathy).copy()
group_mean_corrected[:4] = group_mean_corrected[:4] - berkeley_mean
group_mean_corrected[4:8] = group_mean_corrected[4:8] - stanford_mean
group_mean_corrected[8:] = group_mean_corrected[8:] - mit_mean
group_mean_corrected

array([ 0.67375, -6.22825,  1.46175,  4.09275, -2.9195 , -4.7725 ,
        6.0075 ,  1.6845 , -2.84425, -6.13225,  4.68675,  4.28975])

SSR_full = np.sum(group_mean_corrected ** 2)
SSR_reduced = np.sum(overall_mean_corrected ** 2)
SSR_full, SSR_reduced

(214.42201450000005, 345.6199626666667)

N = len(psychopathy)
df_error = N - 3
df_extra = 3 - 1
F_top = (SSR_reduced - SSR_full) / df_extra
F_bottom = SSR_full / df_error
F_stat = F_top / F_bottom
F_stat

2.75340555925054

# Back to the old model
X_full = np.zeros((N, 3))
X_full[:4, 0] = 1  # Berkeley indicator is first column
X_full[4:8, 1] = 1  # Stanford indicator is second column
X_full[8:, 2] = 1  # MIT indicator is third column
show_design(X_full, 'Group mean')
X_full

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]])

png

X_full

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]])

B_full = npl.pinv(X_full).dot(psychopathy)
B_full

array([10.74225, 11.3355 , 18.03425])

full_fitted = X_full.dot(B_full)
full_residuals = psychopathy - full_fitted
full_residuals

array([ 0.67375, -6.22825,  1.46175,  4.09275, -2.9195 , -4.7725 ,
        6.0075 ,  1.6845 , -2.84425, -6.13225,  4.68675,  4.28975])

# The reduced design - the overall mean
X_reduced = np.ones((N, 1))
X_reduced

array([[1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.]])

B_reduced = npl.pinv(X_reduced).dot(psychopathy)
B_reduced

array([13.370667])

reduced_fitted = X_reduced.dot(B_reduced)
reduced_residuals = psychopathy - reduced_fitted
reduced_residuals

array([-1.954667, -8.856667, -1.166667,  1.464333, -4.954667, -6.807667,
        3.972333, -0.350667,  1.819333, -1.468667,  9.350333,  8.953333])

GLM_SSR_full = np.sum(full_residuals ** 2)
GLM_SSR_reduced = np.sum(reduced_residuals ** 2)
GLM_SSR_full, GLM_SSR_reduced

(214.42201450000005, 345.61996266666665)

OK - now the same thing with the original clammy model:

X_regression_full = np.column_stack((np.ones(12), clammy))
show_design(X_regression_full, 'Simple regression')

png

B_regression_full = npl.pinv(X_regression_full).dot(psychopathy)
B_regression_full

array([10.071286,  0.999257])

fitted_regression_full = X_regression_full.dot(B_regression_full)
residuals_regression_full = psychopathy - fitted_regression_full
residuals_regression_full

array([ 0.956003, -5.757137,  1.891893,  4.301058, -6.23688 , -4.604471,
        5.630934, -2.019593, -2.832376, -3.750137,  7.126819,  5.293887])

fitted_regression_full

array([10.459997, 10.271137, 10.312107, 10.533942, 14.65288 , 11.167471,
       11.712066, 15.039593, 18.022376, 15.652137, 15.594181, 17.030113])

10.071286 * 1 + 0.999257 * np.array(clammy)

array([10.459997, 10.271137, 10.312107, 10.533942, 14.652879, 11.167471,
       11.712066, 15.039592, 18.022374, 15.652136, 15.594179, 17.030112])

SSR_regression_full = np.sum(residuals_regression_full ** 2)
F_top = (SSR_reduced - SSR_regression_full) / 1
F_bottom = SSR_regression_full / (N - 2)
F_stat = F_top / F_bottom
F_stat

3.6648861899665146