This is an example with one numeric variable and 2 categorical variables. It can be found here though this example has modified the data to be characters instead of numeric. This just makes R automatically treat some things as factors.
> url = 'http://stats191.stanford.edu/data/salary.csv'
> data = read.table(url, header=T, sep=',')
> salary.lm = lm(S ~ X:E + X:P, data=data)
> print(coef(salary.lm))
(Intercept) X:EB X:EM X:EP X:PM
12938.3805 190.3962 347.6023 301.9127 698.2767
The order of the two terms above matters
> salary2.lm = lm(S ~ X:P + X:E, data=data)
> print(coef(salary2.lm))
(Intercept) X:PL X:PM X:EM X:EP
12938.3805 190.3962 888.6730 157.2060 111.5165
Let’s try this with the ANCOVA model. First we do some set up:
In [1]: import numpy as np
In [2]: # Set precision printing to match R output
In [3]: np.set_printoptions(precision=4)
In [4]: import matplotlib.mlab as ML
In [5]: from pkg_resources import resource_stream
In [6]: salary = ML.csv2rec(resource_stream("formula", "data/salary.csv"))
In [7]: from formula.parts import fromrec
In [8]: from formula.ancova import *
Next we set up the model:
In [9]: terms = fromrec(salary)
In [10]: x = terms['x']; e = terms['e']; p = terms['p']
In [11]: ancova = ANCOVA((x,e),(x,p))
In [12]: formula = ancova.formula
In [13]: print formula
Formula([1, e_B*x, e_M*x, e_P*x, p_M*x])
To fit the model, we use scikits.statsmodels
In [14]: from scikits.statsmodels.api import OLS
In [15]: model = OLS(salary['s'], formula.design(salary, return_float=True))
In [16]: results = model.fit()
In [17]: print results.params
[ 12938.3805 190.3962 347.6023 301.9127 698.2767]
Interchanging the order above
In [18]: ancova2 = ANCOVA((x,p),(x,e))
In [19]: ancova2.formula.terms
Out[19]: array([1, p_L*x, p_M*x, e_M*x, e_P*x], dtype=object)
In [20]: f2 = ancova2.formula
In [21]: model = OLS(salary['s'], f2.design(salary, return_float=True))
In [22]: results = model.fit()
In [23]: results.params
Out[23]: array([ 12938.3805, 190.3962, 888.673 , 157.206 , 111.5165])
The difference in the order has to do with how R constructs a design matrix from a set of factors (and the corresponding numeric variable).
The difference manifests itself in how each factor multiplying X above is coded in each product. A factor appearing in a product of factors can be coded either as in indicator which includes all relevant dummy variables for the factor or as a contrast which includes one less column. The default behavior of R is to drop a term, which ANCOVA also follows. R drops the first term in a sorted list of the levels of the factor, ANCOVA follows the same behavior by default.
For instance:
In [24]: ancova.codings
Out[24]: {1: {}, x: {('e',): [('e', 'indicator')], ('p',): [('p', 'contrast')]}}
In [25]: ancova2.codings
Out[25]: {1: {}, x: {('e',): [('e', 'contrast')], ('p',): [('p', 'indicator')]}}
In the first formula, P appears as a contrast while E appears as an indicator. In the second formula, E appears as a contrast and P as an indicator.
We can also specify two-way interactions in R
> print(coef(lm(S ~ X:P:E, data=data)))
(Intercept) X:PL:EB X:PM:EB X:PL:EM X:PM:EM X:PL:EP
13454.2368 189.4646 498.6643 275.1639 1046.6828 -124.0655
X:PM:EP
971.4152
As well as in the ANCOVA
In [26]: ancova3 = ANCOVA((x,(p,e)))
In [27]: ancova3.codings
Out[27]: {1: {}, x: {('e', 'p'): [('e', 'indicator'), ('p', 'indicator')]}}
In [28]: f3 = ancova3.formula
In [29]: model = OLS(salary['s'], f3.design(salary, return_float=True))
In [30]: results = model.fit()
In [31]: results.params
Out[31]:
array([ 13454.2368, 189.4646, 498.6643, 275.1639, 1046.6828,
-124.0655, 971.4152])
If we add in some parts of the formula, it becomes a little harder to predict:
> print(coef(lm(S ~ X:E:P + X:P + X:E, data=data)))
(Intercept) X:PL X:PM X:EM X:EP X:EM:PM
13454.23676 189.46457 498.66431 85.69928 -313.53006 462.31922
X:EP:PM
786.28093
In [32]: ancova4 = ANCOVA((x,(p,e)),(x,p),(x,e))
In [33]: ancova4.codings
Out[33]:
{1: {},
x: {('e',): [('e', 'contrast')],
('e', 'p'): [('e', 'contrast'), ('p', 'contrast')],
('p',): [('p', 'indicator')]}}
In [34]: ancova4.formula.terms
Out[34]: array([1, e_M*x, e_P*x, p_L*x, p_M*x, e_M*p_M*x, e_P*p_M*x], dtype=object)
In [35]: model = OLS(salary['s'], ancova4.formula.design(salary, return_float=True))
In [36]: results = model.fit()
In [37]: results.params
Out[37]:
array([ 13454.2368, 85.6993, -313.5301, 189.4646, 498.6643,
462.3192, 786.2809])
Changing the order above again changes the terms in the formula
In [38]: ancova5 = ANCOVA((x,(p,e)),(x,e),(x,p))
In [39]: ancova5.codings
Out[39]:
{1: {},
x: {('e',): [('e', 'indicator')],
('e', 'p'): [('e', 'contrast'), ('p', 'contrast')],
('p',): [('p', 'contrast')]}}
In [40]: ancova.formula.terms
Out[40]: array([1, e_B*x, e_M*x, e_P*x, p_M*x], dtype=object)
In [41]: model = OLS(salary['s'], ancova5.formula.design(salary, return_float=True))
In [42]: results = model.fit()
In [43]: results.params
Out[43]:
array([ 13454.2368, 309.1997, 189.4646, 275.1639, -124.0655,
462.3192, 786.2809])
as it does in R:
> print(coef(lm(S ~ X:E:P + X:E + X:P, data=data)))
(Intercept) X:EB X:EM X:EP X:PM X:EM:PM
13454.2368 189.4646 275.1639 -124.0655 309.1997 462.3192
X:EP:PM
786.2809
What is important is the graded order. That is, for the numeric variable X, the first order factors are ordered in f4 as [set([P]),set([E])] and its second order factors are [set([P,E])] while it has no zeroth order factors. The only difference between ancova4 and ancova5 is the order of its first order factors.
Adding X to the R formula adds a zeroth order factor.
> print(coef(lm(S ~ X + X:E:P + X:E + X:P, data=data)))
(Intercept) X X:EM X:EP X:PM X:EM:PM
13454.23676 189.46457 85.69928 -313.53006 309.19974 462.31922
X:EP:PM
786.28093
With the categorical formula, this can be achieved by
In [44]: ancova6 = ANCOVA(x,(x,e),(x,p),(x,(p,e)))
In [45]: ancova6.codings
Out[45]:
{1: {},
x: {(): [],
('e',): [('e', 'contrast')],
('e', 'p'): [('e', 'contrast'), ('p', 'contrast')],
('p',): [('p', 'contrast')]}}
In [46]: ancova6.formula.terms
Out[46]: array([1, x, p_M*x, e_M*x, e_P*x, e_M*p_M*x, e_P*p_M*x], dtype=object)
In [47]: model = OLS(salary['s'], ancova6.formula.design(salary, return_float=True))
In [48]: results = model.fit()
In [49]: results.params
Out[49]:
array([ 13454.2368, 189.4646, 309.1997, 85.6993, -313.5301,
462.3192, 786.2809])
One more example
> print(coef(lm(S ~ X:E:P + X:E, data=data)))
(Intercept) X:EB X:EM X:EP X:EB:PM X:EM:PM
13454.2368 189.4646 275.1639 -124.0655 309.1997 771.5190
X:EP:PM
1095.4807
In [50]: ancova6a = ANCOVA((x,(e,p)),(x,e))
In [51]: ancova6a.codings
Out[51]:
{1: {},
x: {('e',): [('e', 'indicator')],
('e', 'p'): [('e', 'indicator'), ('p', 'contrast')]}}
In [52]: ancova6a.formula.terms
Out[52]: array([1, e_B*x, e_M*x, e_P*x, e_B*p_M*x, e_M*p_M*x, e_P*p_M*x], dtype=object)
In [53]: model = OLS(salary['s'], ancova6a.formula.design(salary, return_float=True))
In [54]: results = model.fit()
In [55]: results.params
Out[55]:
array([ 13454.2368, 189.4646, 275.1639, -124.0655, 309.1997,
771.519 , 1095.4807])
The ubiquitous intercept can be suppressed using the keyword argument “add_intercept” to the constructor of ANCOVA
In [56]: ancova7 = ANCOVA(x,(x,(p,e)),(x,e),(x,p), add_intercept=False)
In [57]: ancova7.formula.terms
Out[57]: array([x, e_M*x, e_P*x, p_M*x, e_M*p_M*x, e_P*p_M*x], dtype=object)
In [58]: model = OLS(salary['s'], ancova7.formula.design(salary, return_float=True))
In [59]: results = model.fit()
In [60]: results.params
Out[60]:
array([ 1227.9563, 203.7999, 2369.3629, 1452.4762, -656.674 ,
-2744.2625])
In R the intercept can be removed (most of the time) by appending -1 to the string specifying the formula:
> print(coef(lm(S ~ X + X:P:E + X:E + X:P - 1, data=data)))
X X:EB X:EM X:EP X:PM X:PM:EM X:PM:EP
3597.319 -2369.363 -2165.563 NA 1452.476 -656.674 -2744.263
This design matrix is not the same as obtained by ANCOVA, hence, the coefficients are also different. This is related to R‘s treatment of factors and numeric variables as equal. The ANCOVA module makes a distinction between these two. The reason R has a missing value in the coefficients is that its rules for generating design matrices told it that E should be coded with indicators in the term X:E which leads to a linear dependence with X already in the model. The ANCOVA implementation treats X as (X,1) and hence when (X,E) is to be added it sees that there will be a linear dependence if E is added with indicator functions. Effectively, all columns with X in them are the product of the columns of a purely categorical formula. In this case, the columns are the same as
In [61]: ancova7a = ANCOVA((1,(p,e)), (1,e), (1,p))
In [62]: ancova7a.formula.terms
Out[62]: array([1, p_M, e_M, e_P, e_M*p_M, e_P*p_M], dtype=object)
In [63]: ancova7a.formula.terms * x
Out[63]: array([x, p_M*x, e_M*x, e_P*x, e_M*p_M*x, e_P*p_M*x], dtype=object)
In [64]: ancova7.formula.terms
Out[64]: array([x, e_M*x, e_P*x, p_M*x, e_M*p_M*x, e_P*p_M*x], dtype=object)
This is how the ANCOVA is constructed. For each numeric term, there is a corresponding pure categorical formula. For example
In [65]: z = Term('z')
In [66]: ancova7b = ANCOVA((1,e), (z,e), (z,(e,p)), (x*z,e), (x,e), (x,p), x*z)
In [67]: ancova7b.sequence(x)
Out[67]: [(x, (Factor('e', ['B', 'M', 'P']),)), (x, (Factor('p', ['L', 'M']),))]
In [68]: ancova7b.sequence(z*x)
Out[68]: [(x*z, ()), (x*z, (Factor('e', ['B', 'M', 'P']),))]
In [69]: ancova7b.sequence(1)
Out[69]: [(1, ()), (1, (Factor('e', ['B', 'M', 'P']),))]
In [70]: ancova7b.sequence(z)
Out[70]:
[(z, (Factor('e', ['B', 'M', 'P']),)),
(z, (Factor('e', ['B', 'M', 'P']), Factor('p', ['L', 'M'])))]
Any of those sequences above can be used to create new ANCOVA instances whose formulae is that numeric expression multiplied by the corresponding purely categorical formula.
In [71]: ANCOVA(*ancova7b.sequence(z)).formula.terms
Out[71]: array([1, e_B*z, e_M*z, e_P*z, e_B*p_M*z, e_M*p_M*z, e_P*p_M*z], dtype=object)
In [72]: purely_categorical = ANCOVA(*[(1, factors) for _, factors in ancova7b.sequence(z)])
In [73]: purely_categorical.formula.terms
Out[73]: array([1, e_M, e_P, e_B*p_M, e_M*p_M, e_P*p_M], dtype=object)
In [74]: purely_categorical.formula.terms * z
Out[74]: array([z, e_M*z, e_P*z, e_B*p_M*z, e_M*p_M*z, e_P*p_M*z], dtype=object)
Each (expr, factor) pair in the ANCOVA specification maps to a specific contrast.
In [75]: ancova7.contrasts
Out[75]:
{'I(x)': Formula([x]),
'I(x):e': Formula([e_M*x, e_P*x]),
'I(x):e:p': Formula([e_M*p_M*x, e_P*p_M*x]),
'I(x):p': Formula([p_M*x])}
As opposed to
In [76]: ancova3.contrasts
Out[76]:
{'1': Formula([1]),
'I(x):e:p': Formula([e_B*p_L*x, e_B*p_M*x, e_M*p_L*x, e_M*p_M*x, e_P*p_L*x, e_P*p_M*x])}
These contrasts are the default contrasts that drop the first level of the factor. This can be changed with the default_contrast keyword argument
In [77]: ancova8 = ANCOVA(x,(x,(p,e)),(x,e),(x,p), default_contrast='main_effect')
In [78]: ancova8.contrasts
Out[78]:
{'1': Formula([1]),
'I(x)': Formula([x]),
'I(x):e': Formula([x*(-e_B + e_M), x*(-e_B + e_P)]),
'I(x):e:p': Formula([x*(-e_B + e_M)*(-p_L + p_M), x*(-e_B + e_P)*(-p_L + p_M)]),
'I(x):p': Formula([x*(-p_L + p_M)])}
Each contrast can be associated with some columns of the final design matrix. These are also elements of the formula attribute
In [79]: ancova3.slices
Out[79]: {'1': slice(0, 1, None), 'I(x):e:p': slice(1, 7, None)}
The slices can be interpreted as contrast matrices
In [80]: ancova3.contrast_matrices
Out[80]:
{'1': array([[ 1., 0., 0., 0., 0., 0., 0.]]),
'I(x):e:p': array([[ 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0., 0., 1.]])}
Note, however, that these contrast matrices depend on the default_coding argument. Generally speaking, they are appropriate for use when the default_coding is “main_effect” rather than “drop_reference”. TODO: construct these properly for different default coding
Further, not all these contrasts are estimable. . Whether they are estimable or not depends on the actual design matrix used to fit an OLS model. Users should keep this in mind. In this example, the contrast I(X):E:P would not be estimable if we never observed a laborer with a PhD, for example.
In [81]: ancova = ANCOVA((x,e),(x,p),(x,(p,e)))
In [82]: print ML.rec2txt(typeI('s', ancova, salary))
contrast SS df MS F p_value
1 13719944261.761 1.000 13719944261.761 2934.223 0.000
I(x):e 498268906.746 3.000 166089635.582 35.521 0.000
I(x):p 295195692.259 1.000 295195692.259 63.132 0.000
I(x):e:p 25275391.059 2.000 12637695.530 2.703 0.080
Residuals 182357587.174 39.000 4675835.569 nan nan
Compare this to the R output
> anova(lm(S ~ X:E + X:P + X:P:E, data=data))
Analysis of Variance Table
Response: S
Df Sum Sq Mean Sq F value Pr(>F)
X:E 3 498268907 166089636 35.5208 3.077e-11 ***
X:P 1 295195692 295195692 63.1322 1.119e-09 ***
X:E:P 2 25275391 12637696 2.7028 0.07957 .
Residuals 39 182357587 4675836
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
For type II:
In [83]: print ML.rec2txt(typeII('s', ancova, salary))
contrast SS df MS F p_value
1 2160721082.840 1.000 2160721082.840 462.104 0.000
I(x):e 96080370.831 3.000 32026790.277 6.849 0.001
I(x):p 295195692.259 1.000 295195692.259 63.132 0.000
I(x):e:p 25275391.059 2.000 12637695.530 2.703 0.080
Residuals 182357587.174 39.000 4675835.569 nan nan
> library(car)
> Anova(lm(S ~ X:E + X:P + X:P:E, data=data), type='II')
Anova Table (Type II tests)
Response: S
Sum Sq Df F value Pr(>F)
X:E 96080371 3 6.8494 0.0008116 ***
X:P 295195692 1 63.1322 1.119e-09 ***
X:E:P 25275391 2 2.7028 0.0795676 .
Residuals 182357587 39
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
And type III:
In [84]: print ML.rec2txt(typeIII('s', ancova, salary))
contrast SS df MS F p_value
1 2160721082.840 1.000 2160721082.840 462.104 0.000
I(x):e 70707719.836 3.000 23569239.945 5.041 0.005
I(x):p 11631680.248 1.000 11631680.248 2.488 0.123
I(x):e:p 25275391.059 2.000 12637695.530 2.703 0.080
Residuals 182357587.174 39.000 4675835.569 nan nan
> library(car)
> Anova(lm(S ~ X:E + X:P + X:P:E, data=data), type='III')
Anova Table (Type III tests)
Response: S
Sum Sq Df F value Pr(>F)
(Intercept) 2160721083 1 462.1037 < 2.2e-16 ***
X:E 70707720 3 5.0406 0.004775 **
X:P 11631680 1 2.4876 0.122823
X:E:P 25275391 2 2.7028 0.079568 .
Residuals 182357587 39
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Reversing the order changes the ANOVA tables, in particular the degrees of freedom associated to each contrast. This is because the codings change when the order of the factors change.
In [85]: ancova2 = ANCOVA((x,p),(x,e), (x,(p,e)))
In [86]: print ML.rec2txt(typeII('s', ancova2, salary))
contrast SS df MS F p_value
1 2160721082.840 1.000 2160721082.840 462.104 0.000
I(x):p 344134184.851 2.000 172067092.425 36.799 0.000
I(x):e 17420924.250 2.000 8710462.125 1.863 0.169
I(x):e:p 25275391.059 2.000 12637695.530 2.703 0.080
Residuals 182357587.174 39.000 4675835.569 nan nan
> library(car)
> Anova(lm(S ~ X:P + X:E + X:P:E, data=data), type='II')
Anova Table (Type II tests)
Response: S
Sum Sq Df F value Pr(>F)
X:P 344134185 2 36.7992 1.049e-09 ***
X:E 17420924 2 1.8629 0.16878
X:P:E 25275391 2 2.7028 0.07957 .
Residuals 182357587 39
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In [87]: print ML.rec2txt(typeIII('s', ancova2, salary))
contrast SS df MS F p_value
1 2160721082.840 1.000 2160721082.840 462.104 0.000
I(x):p 41682012.745 2.000 20841006.372 4.457 0.018
I(x):e 9511377.475 2.000 4755688.737 1.017 0.371
I(x):e:p 25275391.059 2.000 12637695.530 2.703 0.080
Residuals 182357587.174 39.000 4675835.569 nan nan
> library(car)
> Anova(lm(S ~ X:P + X:E + X:P:E, data=data), type='III')
Anova Table (Type III tests)
Response: S
Sum Sq Df F value Pr(>F)
(Intercept) 2160721083 1 462.1037 < 2e-16 ***
X:P 41682013 2 4.4572 0.01806 *
X:E 9511377 2 1.0171 0.37104
X:P:E 25275391 2 2.7028 0.07957 .
Residuals 182357587 39
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1