py4sci

Table Of Contents

Previous topic

<no title>

Next topic

Formula

This Page

Categorical Formulae

Example

This is an example with one numeric variable and 2 categorical variables. It can be found here though this example has modified the data to be characters instead of numeric. This just makes R automatically treat some things as factors.

> url = 'http://stats191.stanford.edu/data/salary.csv'
> data = read.table(url, header=T, sep=',')

> salary.lm = lm(S ~ X:E + X:P, data=data)
> print(coef(salary.lm))
(Intercept)        X:EB        X:EM        X:EP        X:PM 
 12938.3805    190.3962    347.6023    301.9127    698.2767 

The order of the two terms above matters

> salary2.lm = lm(S ~ X:P + X:E, data=data)
> print(coef(salary2.lm))
(Intercept)        X:PL        X:PM        X:EM        X:EP 
 12938.3805    190.3962    888.6730    157.2060    111.5165 

Let’s try this with the ANCOVA model. First we do some set up:

In [1]: import numpy as np

In [2]: # Set precision printing to match R output

In [3]: np.set_printoptions(precision=4)

In [4]: import matplotlib.mlab as ML

In [5]: from pkg_resources import resource_stream

In [6]: salary = ML.csv2rec(resource_stream("formula", "data/salary.csv"))

In [7]: from formula.parts import fromrec

In [8]: from formula.ancova import *

Next we set up the model:

In [9]: terms = fromrec(salary)

In [10]: x = terms['x']; e = terms['e']; p = terms['p']

In [11]: ancova = ANCOVA((x,e),(x,p))

In [12]: formula = ancova.formula

In [13]: print formula
Formula([1, e_B*x, e_M*x, e_P*x, p_M*x])

To fit the model, we use scikits.statsmodels

In [14]: from scikits.statsmodels.api import OLS

In [15]: model = OLS(salary['s'], formula.design(salary, return_float=True))

In [16]: results = model.fit()

In [17]: print results.params
[ 12938.3805    190.3962    347.6023    301.9127    698.2767]

Interchanging the order above

In [18]: ancova2 = ANCOVA((x,p),(x,e))

In [19]: ancova2.formula.terms
Out[19]: array([1, p_L*x, p_M*x, e_M*x, e_P*x], dtype=object)

In [20]: f2 = ancova2.formula

In [21]: model = OLS(salary['s'], f2.design(salary, return_float=True))

In [22]: results = model.fit()

In [23]: results.params
Out[23]: array([ 12938.3805,    190.3962,    888.673 ,    157.206 ,    111.5165])

The difference in the order has to do with how R constructs a design matrix from a set of factors (and the corresponding numeric variable).

Factor Codings

The difference manifests itself in how each factor multiplying X above is coded in each product. A factor appearing in a product of factors can be coded either as in indicator which includes all relevant dummy variables for the factor or as a contrast which includes one less column. The default behavior of R is to drop a term, which ANCOVA also follows. R drops the first term in a sorted list of the levels of the factor, ANCOVA follows the same behavior by default.

For instance:

In [24]: ancova.codings
Out[24]: {1: {}, x: {('e',): [('e', 'indicator')], ('p',): [('p', 'contrast')]}}

In [25]: ancova2.codings
Out[25]: {1: {}, x: {('e',): [('e', 'contrast')], ('p',): [('p', 'indicator')]}}

In the first formula, P appears as a contrast while E appears as an indicator. In the second formula, E appears as a contrast and P as an indicator.

Two-way interactions

We can also specify two-way interactions in R

> print(coef(lm(S ~ X:P:E, data=data)))
(Intercept)     X:PL:EB     X:PM:EB     X:PL:EM     X:PM:EM     X:PL:EP 
 13454.2368    189.4646    498.6643    275.1639   1046.6828   -124.0655 
    X:PM:EP 
   971.4152 

As well as in the ANCOVA

In [26]: ancova3 = ANCOVA((x,(p,e)))

In [27]: ancova3.codings
Out[27]: {1: {}, x: {('e', 'p'): [('e', 'indicator'), ('p', 'indicator')]}}

In [28]: f3 = ancova3.formula

In [29]: model = OLS(salary['s'], f3.design(salary, return_float=True))

In [30]: results = model.fit()

In [31]: results.params
Out[31]: 
array([ 13454.2368,    189.4646,    498.6643,    275.1639,   1046.6828,
         -124.0655,    971.4152])

If we add in some parts of the formula, it becomes a little harder to predict:

> print(coef(lm(S ~ X:E:P + X:P + X:E, data=data)))
(Intercept)        X:PL        X:PM        X:EM        X:EP     X:EM:PM 
13454.23676   189.46457   498.66431    85.69928  -313.53006   462.31922 
    X:EP:PM 
  786.28093 
In [32]: ancova4 = ANCOVA((x,(p,e)),(x,p),(x,e))

In [33]: ancova4.codings
Out[33]: 
{1: {},
 x: {('e',): [('e', 'contrast')],
     ('e', 'p'): [('e', 'contrast'), ('p', 'contrast')],
     ('p',): [('p', 'indicator')]}}

In [34]: ancova4.formula.terms
Out[34]: array([1, e_M*x, e_P*x, p_L*x, p_M*x, e_M*p_M*x, e_P*p_M*x], dtype=object)

In [35]: model = OLS(salary['s'], ancova4.formula.design(salary, return_float=True))

In [36]: results = model.fit()

In [37]: results.params
Out[37]: 
array([ 13454.2368,     85.6993,   -313.5301,    189.4646,    498.6643,
          462.3192,    786.2809])

Changing the order above again changes the terms in the formula

In [38]: ancova5 = ANCOVA((x,(p,e)),(x,e),(x,p))

In [39]: ancova5.codings
Out[39]: 
{1: {},
 x: {('e',): [('e', 'indicator')],
     ('e', 'p'): [('e', 'contrast'), ('p', 'contrast')],
     ('p',): [('p', 'contrast')]}}

In [40]: ancova.formula.terms
Out[40]: array([1, e_B*x, e_M*x, e_P*x, p_M*x], dtype=object)

In [41]: model = OLS(salary['s'], ancova5.formula.design(salary, return_float=True))

In [42]: results = model.fit()

In [43]: results.params
Out[43]: 
array([ 13454.2368,    309.1997,    189.4646,    275.1639,   -124.0655,
          462.3192,    786.2809])

as it does in R:

> print(coef(lm(S ~ X:E:P + X:E + X:P, data=data)))
(Intercept)        X:EB        X:EM        X:EP        X:PM     X:EM:PM 
 13454.2368    189.4646    275.1639   -124.0655    309.1997    462.3192 
    X:EP:PM 
   786.2809 

What is important is the graded order. That is, for the numeric variable X, the first order factors are ordered in f4 as [set([P]),set([E])] and its second order factors are [set([P,E])] while it has no zeroth order factors. The only difference between ancova4 and ancova5 is the order of its first order factors.

Adding X to the R formula adds a zeroth order factor.

> print(coef(lm(S ~ X + X:E:P + X:E + X:P, data=data)))
(Intercept)           X        X:EM        X:EP        X:PM     X:EM:PM 
13454.23676   189.46457    85.69928  -313.53006   309.19974   462.31922 
    X:EP:PM 
  786.28093 

With the categorical formula, this can be achieved by

In [44]: ancova6 = ANCOVA(x,(x,e),(x,p),(x,(p,e)))

In [45]: ancova6.codings
Out[45]: 
{1: {},
 x: {(): [],
     ('e',): [('e', 'contrast')],
     ('e', 'p'): [('e', 'contrast'), ('p', 'contrast')],
     ('p',): [('p', 'contrast')]}}

In [46]: ancova6.formula.terms
Out[46]: array([1, x, p_M*x, e_M*x, e_P*x, e_M*p_M*x, e_P*p_M*x], dtype=object)

In [47]: model = OLS(salary['s'], ancova6.formula.design(salary, return_float=True))

In [48]: results = model.fit()

In [49]: results.params
Out[49]: 
array([ 13454.2368,    189.4646,    309.1997,     85.6993,   -313.5301,
          462.3192,    786.2809])

One more example

> print(coef(lm(S ~ X:E:P + X:E, data=data)))
(Intercept)        X:EB        X:EM        X:EP     X:EB:PM     X:EM:PM 
 13454.2368    189.4646    275.1639   -124.0655    309.1997    771.5190 
    X:EP:PM 
  1095.4807 
In [50]: ancova6a = ANCOVA((x,(e,p)),(x,e))

In [51]: ancova6a.codings
Out[51]: 
{1: {},
 x: {('e',): [('e', 'indicator')],
     ('e', 'p'): [('e', 'indicator'), ('p', 'contrast')]}}

In [52]: ancova6a.formula.terms
Out[52]: array([1, e_B*x, e_M*x, e_P*x, e_B*p_M*x, e_M*p_M*x, e_P*p_M*x], dtype=object)

In [53]: model = OLS(salary['s'], ancova6a.formula.design(salary, return_float=True))

In [54]: results = model.fit()

In [55]: results.params
Out[55]: 
array([ 13454.2368,    189.4646,    275.1639,   -124.0655,    309.1997,
          771.519 ,   1095.4807])

Intercept

The ubiquitous intercept can be suppressed using the keyword argument “add_intercept” to the constructor of ANCOVA

In [56]: ancova7 = ANCOVA(x,(x,(p,e)),(x,e),(x,p), add_intercept=False)

In [57]: ancova7.formula.terms
Out[57]: array([x, e_M*x, e_P*x, p_M*x, e_M*p_M*x, e_P*p_M*x], dtype=object)

In [58]: model = OLS(salary['s'], ancova7.formula.design(salary, return_float=True))

In [59]: results = model.fit()

In [60]: results.params
Out[60]: 
array([ 1227.9563,   203.7999,  2369.3629,  1452.4762,  -656.674 ,
       -2744.2625])

In R the intercept can be removed (most of the time) by appending -1 to the string specifying the formula:

> print(coef(lm(S ~ X + X:P:E + X:E + X:P - 1, data=data)))
        X      X:EB      X:EM      X:EP      X:PM   X:PM:EM   X:PM:EP 
 3597.319 -2369.363 -2165.563        NA  1452.476  -656.674 -2744.263 

This design matrix is not the same as obtained by ANCOVA, hence, the coefficients are also different. This is related to R‘s treatment of factors and numeric variables as equal. The ANCOVA module makes a distinction between these two. The reason R has a missing value in the coefficients is that its rules for generating design matrices told it that E should be coded with indicators in the term X:E which leads to a linear dependence with X already in the model. The ANCOVA implementation treats X as (X,1) and hence when (X,E) is to be added it sees that there will be a linear dependence if E is added with indicator functions. Effectively, all columns with X in them are the product of the columns of a purely categorical formula. In this case, the columns are the same as

In [61]: ancova7a = ANCOVA((1,(p,e)), (1,e), (1,p))

In [62]: ancova7a.formula.terms
Out[62]: array([1, p_M, e_M, e_P, e_M*p_M, e_P*p_M], dtype=object)

In [63]: ancova7a.formula.terms * x
Out[63]: array([x, p_M*x, e_M*x, e_P*x, e_M*p_M*x, e_P*p_M*x], dtype=object)

In [64]: ancova7.formula.terms
Out[64]: array([x, e_M*x, e_P*x, p_M*x, e_M*p_M*x, e_P*p_M*x], dtype=object)

This is how the ANCOVA is constructed. For each numeric term, there is a corresponding pure categorical formula. For example

In [65]: z = Term('z')

In [66]: ancova7b = ANCOVA((1,e), (z,e), (z,(e,p)), (x*z,e), (x,e), (x,p), x*z)

In [67]: ancova7b.sequence(x)
Out[67]: [(x, (Factor('e', ['B', 'M', 'P']),)), (x, (Factor('p', ['L', 'M']),))]

In [68]: ancova7b.sequence(z*x)
Out[68]: [(x*z, ()), (x*z, (Factor('e', ['B', 'M', 'P']),))]

In [69]: ancova7b.sequence(1)
Out[69]: [(1, ()), (1, (Factor('e', ['B', 'M', 'P']),))]

In [70]: ancova7b.sequence(z)
Out[70]: 
[(z, (Factor('e', ['B', 'M', 'P']),)),
 (z, (Factor('e', ['B', 'M', 'P']), Factor('p', ['L', 'M'])))]

Any of those sequences above can be used to create new ANCOVA instances whose formulae is that numeric expression multiplied by the corresponding purely categorical formula.

In [71]: ANCOVA(*ancova7b.sequence(z)).formula.terms
Out[71]: array([1, e_B*z, e_M*z, e_P*z, e_B*p_M*z, e_M*p_M*z, e_P*p_M*z], dtype=object)

In [72]: purely_categorical = ANCOVA(*[(1, factors) for _, factors in ancova7b.sequence(z)])

In [73]: purely_categorical.formula.terms
Out[73]: array([1, e_M, e_P, e_B*p_M, e_M*p_M, e_P*p_M], dtype=object)

In [74]: purely_categorical.formula.terms * z
Out[74]: array([z, e_M*z, e_P*z, e_B*p_M*z, e_M*p_M*z, e_P*p_M*z], dtype=object)

Contrasts

Each (expr, factor) pair in the ANCOVA specification maps to a specific contrast.

In [75]: ancova7.contrasts
Out[75]: 
{'I(x)': Formula([x]),
 'I(x):e': Formula([e_M*x, e_P*x]),
 'I(x):e:p': Formula([e_M*p_M*x, e_P*p_M*x]),
 'I(x):p': Formula([p_M*x])}

As opposed to

In [76]: ancova3.contrasts
Out[76]: 
{'1': Formula([1]),
 'I(x):e:p': Formula([e_B*p_L*x, e_B*p_M*x, e_M*p_L*x, e_M*p_M*x, e_P*p_L*x, e_P*p_M*x])}

These contrasts are the default contrasts that drop the first level of the factor. This can be changed with the default_contrast keyword argument

In [77]: ancova8 = ANCOVA(x,(x,(p,e)),(x,e),(x,p), default_contrast='main_effect')

In [78]: ancova8.contrasts
Out[78]: 
{'1': Formula([1]),
 'I(x)': Formula([x]),
 'I(x):e': Formula([x*(-e_B + e_M), x*(-e_B + e_P)]),
 'I(x):e:p': Formula([x*(-e_B + e_M)*(-p_L + p_M), x*(-e_B + e_P)*(-p_L + p_M)]),
 'I(x):p': Formula([x*(-p_L + p_M)])}

Contrast Matrices & Slices

Each contrast can be associated with some columns of the final design matrix. These are also elements of the formula attribute

In [79]: ancova3.slices
Out[79]: {'1': slice(0, 1, None), 'I(x):e:p': slice(1, 7, None)}

The slices can be interpreted as contrast matrices

In [80]: ancova3.contrast_matrices
Out[80]: 
{'1': array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.]]),
 'I(x):e:p': array([[ 0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.]])}

Note, however, that these contrast matrices depend on the default_coding argument. Generally speaking, they are appropriate for use when the default_coding is “main_effect” rather than “drop_reference”. TODO: construct these properly for different default coding

Further, not all these contrasts are estimable. . Whether they are estimable or not depends on the actual design matrix used to fit an OLS model. Users should keep this in mind. In this example, the contrast I(X):E:P would not be estimable if we never observed a laborer with a PhD, for example.

Sums of squares

In [81]: ancova = ANCOVA((x,e),(x,p),(x,(p,e)))

In [82]: print ML.rec2txt(typeI('s', ancova, salary))
contrast                 SS       df                MS          F   p_value
1           13719944261.761    1.000   13719944261.761   2934.223     0.000
I(x):e        498268906.746    3.000     166089635.582     35.521     0.000
I(x):p        295195692.259    1.000     295195692.259     63.132     0.000
I(x):e:p       25275391.059    2.000      12637695.530      2.703     0.080
Residuals     182357587.174   39.000       4675835.569        nan       nan

Compare this to the R output

> anova(lm(S ~ X:E + X:P + X:P:E, data=data))
Analysis of Variance Table

Response: S
          Df    Sum Sq   Mean Sq F value    Pr(>F)    
X:E        3 498268907 166089636 35.5208 3.077e-11 ***
X:P        1 295195692 295195692 63.1322 1.119e-09 ***
X:E:P      2  25275391  12637696  2.7028   0.07957 .  
Residuals 39 182357587   4675836                      
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1 

For type II:

In [83]: print ML.rec2txt(typeII('s', ancova, salary))
contrast                SS       df               MS         F   p_value
1           2160721082.840    1.000   2160721082.840   462.104     0.000
I(x):e        96080370.831    3.000     32026790.277     6.849     0.001
I(x):p       295195692.259    1.000    295195692.259    63.132     0.000
I(x):e:p      25275391.059    2.000     12637695.530     2.703     0.080
Residuals    182357587.174   39.000      4675835.569       nan       nan
> library(car)
> Anova(lm(S ~ X:E + X:P + X:P:E, data=data), type='II')
Anova Table (Type II tests)

Response: S
             Sum Sq Df F value    Pr(>F)    
X:E        96080371  3  6.8494 0.0008116 ***
X:P       295195692  1 63.1322 1.119e-09 ***
X:E:P      25275391  2  2.7028 0.0795676 .  
Residuals 182357587 39                      
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1 

And type III:

In [84]: print ML.rec2txt(typeIII('s', ancova, salary))
contrast                SS       df               MS         F   p_value
1           2160721082.840    1.000   2160721082.840   462.104     0.000
I(x):e        70707719.836    3.000     23569239.945     5.041     0.005
I(x):p        11631680.248    1.000     11631680.248     2.488     0.123
I(x):e:p      25275391.059    2.000     12637695.530     2.703     0.080
Residuals    182357587.174   39.000      4675835.569       nan       nan
> library(car)
> Anova(lm(S ~ X:E + X:P + X:P:E, data=data), type='III')
Anova Table (Type III tests)

Response: S
                Sum Sq Df  F value    Pr(>F)    
(Intercept) 2160721083  1 462.1037 < 2.2e-16 ***
X:E           70707720  3   5.0406  0.004775 ** 
X:P           11631680  1   2.4876  0.122823    
X:E:P         25275391  2   2.7028  0.079568 .  
Residuals    182357587 39                       
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1 

Reversing the order changes the ANOVA tables, in particular the degrees of freedom associated to each contrast. This is because the codings change when the order of the factors change.

In [85]: ancova2 = ANCOVA((x,p),(x,e), (x,(p,e)))

In [86]: print ML.rec2txt(typeII('s', ancova2, salary))
contrast                SS       df               MS         F   p_value
1           2160721082.840    1.000   2160721082.840   462.104     0.000
I(x):p       344134184.851    2.000    172067092.425    36.799     0.000
I(x):e        17420924.250    2.000      8710462.125     1.863     0.169
I(x):e:p      25275391.059    2.000     12637695.530     2.703     0.080
Residuals    182357587.174   39.000      4675835.569       nan       nan
> library(car)
> Anova(lm(S ~ X:P + X:E + X:P:E, data=data), type='II')
Anova Table (Type II tests)

Response: S
             Sum Sq Df F value    Pr(>F)    
X:P       344134185  2 36.7992 1.049e-09 ***
X:E        17420924  2  1.8629   0.16878    
X:P:E      25275391  2  2.7028   0.07957 .  
Residuals 182357587 39                      
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1 
In [87]: print ML.rec2txt(typeIII('s', ancova2, salary))
contrast                SS       df               MS         F   p_value
1           2160721082.840    1.000   2160721082.840   462.104     0.000
I(x):p        41682012.745    2.000     20841006.372     4.457     0.018
I(x):e         9511377.475    2.000      4755688.737     1.017     0.371
I(x):e:p      25275391.059    2.000     12637695.530     2.703     0.080
Residuals    182357587.174   39.000      4675835.569       nan       nan
> library(car)
> Anova(lm(S ~ X:P + X:E + X:P:E, data=data), type='III')
Anova Table (Type III tests)

Response: S
                Sum Sq Df  F value  Pr(>F)    
(Intercept) 2160721083  1 462.1037 < 2e-16 ***
X:P           41682013  2   4.4572 0.01806 *  
X:E            9511377  2   1.0171 0.37104    
X:P:E         25275391  2   2.7028 0.07957 .  
Residuals    182357587 39                     
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1