Formula¶

I am going to try and formalize a version of R‘s formula distinguishing between numeric and factor-like variables with the distinction that, for a numeric variable $x^2 \neq x$ unlike in R. It also will not have a ‘’-‘’ operation.

Factors¶

A factor is a categorical variable, and the usual interpretation of

> a=factor(levels=c(2,3,4))
> b=factor(levels=c(3,4))
> ~a*b
~a * b

is as the Cartesian product of the columns of indicator matrices for a and b. But in R the formula has meaning even not knowing the levels of a and b. The expression ‘’‘a*b’‘’ above has a multiplication in it. R uses two different multiplications, I will only use one, and it corresponds to R‘s symbol :.

Formally, let ${\cal F}=\{f_1, \dots, f_k\}$ be a set of factors. Multiplication automatically suggests some algebraic rules for formulae. We usually think of factors as obeying the rules

$\begin{aligned} f_i\cdot f_i &= f_i \\ f_i\cdot (f_j\cdot f_k) &= (f_i\cdot f_j)\cdot f_k \\ f_i\cdot f_j &= f_j\cdot f_i \\ \pmb{1} \cdot f_i &= f_i \end{aligned}$

The rules correspond to: idempotency, associativity and commutativity. If we added an inverse to multiplication, then the set of factors would constitute a group. However, there is no inverse to multiplication. Algebraically, this means that the set of factors, combined with this operation and the properties above generates an algebraic object known as a monoid. Note, the $\pmb{1}$ above is not the constant 1, but the identity in the monoid ${\cal M}$ generated by ${\cal F}$ and the product “ $\cdot$ “. Algebraically, this idempotent, commutative monoid ${\cal M}$ is isomorphic to the join-semilattice of subsets of ${\cal F}$ with join representing union. In this view, $\pmb{1}$ corresponds to the emptyset $\emptyset$ .

In R‘s formulae there is also an addition operation, +. A monoid does not have addition, it only has one operation, multiplication. In order to introduce addition, we need to introduce something like an algebra. If we had a group, the natural structure to consider is called a group algebra or group ring. The corresponding notion for a monoid is that of a monoid ring or a monoid algebra, if the ring used in constructing the monoid ring is commutative. The monoid algebra is constructed from a monoid, which we already have, and a commutative ring.

For the classical ANOVA models (no numeric variables), we can take this ring to be $\mathbb{Z}$ . Below, when we introduce things like numeric vectors in R to our formula, we will change the ring to be the ring of real-valued functions on some space.

The monoid algebra $A$ of our monoid ${\cal M}$ over $\mathbb{Z}$ can be expressed in terms of sets of 2-tuples $(i, m): i \in \mathbb{Z}, m \in {\cal M}$ with multiplication, addition and subtraction defined as:

$\begin{aligned} \{(i,m)\} \cdot \{(j,m')\} &= \{(i\cdot j, m \cdot m')\} \\ \{(i,m)\} + \{(j,m')\} &= \{(i, m), (j, m')\} \\ \{(i,m)\} + \{(j,m)\} &= \{(i+j,m)\} \\ \end{aligned}$

If we introduce variables $1_m$ then these rules can be written as

$\begin{aligned} \left(i \cdot 1_m \right) \cdot \left(j \cdot 1_{m'} \right) &= \left(i \cdot j \right) 1_{m \cdot m'}\\ &= \{(i\cdot j, m \cdot m')\} \\ \{(i, m), (j, m')\} &= i \cdot 1_m + j \cdot 1_{m'} \\ \{(i,m)\} + \{(j,m)\} &= i \cdot 1_m + j \cdot 1_m \\ &= (i+j) 1_m \end{aligned}$

As our monoid is idempotent and commutative, we arrive at the conclusion that

$\prod_{j \in J}\left(i_j, m_j^{a_j} \right) = \left(\prod_{j \in J} i_j, \prod_{j \in J} m_j \right) \qquad a=(a_j)_{j \in J} \in \mathbb{N}^J$

This places R‘s convention

> a*a
logical(0)
Warning message:
In Ops.factor(a, a) : * not meaningful for factors

into a meaningful algebraic setting.

Actually, R does this for numeric variables as well, unless one uses the I notation.

> terms(formula(~x^2))
~x^2
attr(,"variables")
list(x)
attr(,"factors")
  x
x 1
attr(,"term.labels")
[1] "x"
attr(,"order")
[1] 1
attr(,"intercept")
[1] 1
attr(,"response")
[1] 0
attr(,".Environment")
<environment: R_GlobalEnv>
> terms(formula(~I(x^2)))
~I(x^2)
attr(,"variables")
list(I(x^2))
attr(,"factors")
       I(x^2)
I(x^2)      1
attr(,"term.labels")
[1] "I(x^2)"
attr(,"order")
[1] 1
attr(,"intercept")
[1] 1
attr(,"response")
[1] 0
attr(,".Environment")
<environment: R_GlobalEnv>

Equivalence relation¶

Unfortunately, we cannot make sense of R‘s convention that $m+m=m$ in the monoid algebra. There are a few things wrong with the expression $m+m$ if we take $m$ to be elements of the monoid ${\cal M}$ . The most important thing is to note that $m$ is not an element of the algebra $A$ , which is the structure where we have defined addition. However, we can define an equivalence relation $\sim$ on $A$ as follows

Let the map to equivalence classes be denoted by $\pi:A \rightarrow \{0,1\}^{{\cal M}}$ . Then, R‘s convention $m+m=m$ is equivalent to

$\left\{(i, m) \right\} \sim \left\{(j, m) \right\} \qquad \forall i,j \neq 0 \in \mathbb{Z}$

or

$\pi\left(\left\{(i,m)\right\}\right) = \pi\left(\left\{(j,m)\right\} \right), \qquad \forall i, j \neq 0$

The set of equivalence classes has a well-defined notion of addition. Given two equivalence classes $s_1, s_2 \in \{0,1\}^{\cal M}$ we define

$s_1 + s_2 = \pi \left(\pi^{-1}(s_1) + \pi^{-1}(s_2) \right)$

where $\pi^{-1}(s)$ is a generic element of the equivalence class $s$ . Since our algebra is an algebra over $\mathbb{Z}$ , we can take the generic elements to be random integers, say in the range [-1e6,1e6]. This means that this notion of the sum of equivalence classes will break down occasionally, in the sense that, with probability roughly 4e-12, $\pi(\{(1,m)\}) + \pi(\{(1,m)\}) = 0$ . In the implementation of this in nipy, the ring we take can be thought of as the ring of sympy dummy symbols, so, up to memory errors, $\{m\} + \{m\}=\{m\}$ always.

The set of equivalence classes $\{0,1\}^{\cal M}$ also has a well defined notion of product

$s_1 \cdot s_2 = \pi \left(\pi^{-1}(s_1) \cdot \pi^{-1}(s_2) \right)$

This, too, may occasionally break down depending on how we define our generic element $\pi^{-1}(s)$ . If we take the ring to be $\mathbb{R}$ and generate a generic element by drawing from a random variable with a continuous density, like the standard normal $N(0,1)$ , then with probability 1, the rules of addition and multiplication will not break down if we perform countably many such operations.

It is not hard to see that the multiplication and addition are associative, commutative and distributive on $\{0,1\}^{\cal M}$ .

While addition and multiplication are well-defined on the set of equivalence classes, there is no corresponding notion of subtraction of equivalence classes because, generically,

$\pi(\pi^{-1}(s_1) - \pi^{-1}(s_2)) = \pi(\pi^{-1}(s_1) + \pi^{-1}(s_2))$

Therefore, we cannot call the set of equivalence classes a ring or any other familiar object. As our set of equivalence classes is just $\{0,1\}^{\cal M}$ we can delete the terms from $s_d$ that appear in $s$ by the elementwise multiplication

$s(m) * (1 - s_d(m)), \qquad m \in {\cal M}$

Or, regarding $s,s_d$ as subsets of ${\cal M}$ , deletion of terms is just the operation $s \cap s_d^c$ .

We have now given a formal way of describing which sets of factors are to be included in a regression model. The set of factors to be included in a regression model is specified by an element of $\{0,1\}^{\cal M}$ where the monoid ${\cal M}$ is the semilattice of subsets of :math:{cal F}. On the set of factor specifications $\{0,1\}^{\cal M}$ , we have defined two operations that obey the commutative, distributive and associative properties.

Numeric variables¶

Above, we described rules for specifying which factors are to be included in the model. One can take the same approach for numeric variables, which is essentially what R does. However, R ultimately treats numeric and factors differently when it comes to building the design matrix. For instance,

In the first model below, subtracting 1 yields fitted values orthogonal to the constant, while this is not true in the second model.

> x = rnorm(40)
> y = rnorm(40)
> f = factor(rep(c(1,2),20))

> r1 = resid(lm(y~f-1))
> print(sum(r1))
[1] 1.124101e-15
> r2 = resid(lm(y~x-1))
> print(sum(r2))
[1] 2.616077

There is a perfectly good explanation for this if one uses R‘s rules for constructing a design matrix from an element of $\{0,1\}^{\cal M}$ , as we will see below.

However, it does mean that factors and numerics are treated separately. One way to address this is to consider the monoid

${\cal M}_N \times {\cal M}_F$

where ${\cal M}_F$ is a monoid for factors that is essentially the same monoid we saw in the previous section and ${\cal M}_N$ is a monoid for numerical variables. The multiplication in this monoid is defined componentwise

$(m_{N,1}, m_{F,1}) \times (m_{N,2}, m_{F,2}) = (m_{N,1} \cdot m_{N,2}, m_{F,1} \cdot m_{F,2})$

The only remaining issue is how do we define multiplication in the monoid ${\cal M}_N$ . R uses the same idempotency rules as for factors, though it seems more natural to use true multiplication instead.

> lm(y~x*x)$coef
 (Intercept)            x 
0.0775913698 0.0003296236 
> lm(y~x)$coef
 (Intercept)            x 
0.0775913698 0.0003296236 

This leaves room for some weird manipulations

> lm(y~x*log(exp(x)))

Call:
lm(formula = y ~ x * log(exp(x)))

Coefficients:
  (Intercept)              x    log(exp(x))  x:log(exp(x))  
      0.06966        0.01130             NA        0.01779  

> lm(y~x + I(x^2))

Call:
lm(formula = y ~ x + I(x^2))

Coefficients:
(Intercept)            x       I(x^2)  
    0.06966      0.01130      0.01779  

Using a symbolic system, like sympy can be helpful in cases like this.

Before continuing, let us make a slight modification of the previous monoid that recognizes the fact that these objects are meant to represent random variables, that is functions on some sample space $\Omega$ . The process of constructing the design matrix is then just the evaluation of these functions at some point $\omega \in \Omega$ . If we want to think of a more concrete probability model, we can think of a data.frame with $n$ rows as a point in $\Omega^n$ and our probability distribution might correspond to IID samples from some probability measure $(\Omega, {\cal B}, \mathbb{P})$ . Of course, constructing the design matrix is completely separate from specifying the joint distribution on $\Omega^n$ , but thinking of factors and numeric variables as functions makes things more concrete.

In this new view, a factor from the previous section is now defined as a function $f_i :\Omega \rightarrow L_i$ where $L_i$ is the set of levels of the factor. We define a numeric variable as a real-valued function on $\Omega$ .

With this definition, ${\cal F}$ is a (finite) collection of functions on $\Omega$ , each taking values in a different space, and ${\cal M}_F$ is the same monoid as before. Therefore, a member of the set of equivalence classes $\{0,1\}^{{\cal M}_F}$ can be viewed as a set of functions

$\left\{f_i:\Omega \rightarrow L_i \right\} \subset {\cal F}$

We might then take ${\cal M}_N={\mathbb{R}}^{\Omega}$ to be the monoid of real-valued functions on $\Omega$ with multiplication defined in the obvious way

$(f_1 \cdot f_2)(\omega) = f_1(\omega) \cdot f_2(\omega)$

There is, however, a possible measure theoretic complication to consider as we should properly consider equivalence classes on $\mathbb{R}^{\Omega}$ defined by equality up to sets of probability 0 according to some probability space $(\Omega, {\cal B}, \mathbb{P})$ . Let us therefore fix this probability space and take ${\cal M}_N = L^0(\Omega, {\cal B}, \mathbb{P})$ , the collection of random variables on $(\Omega, {\cal B}, \mathbb{P})$ .

Carrying out the same construction of equivalence classes as for factors, our model is represented as a member of $\{0,1\}^{L^0(\Omega, {\cal B}, \mathbb{P}) \times {\cal M}_F}$ . Viewing this as a subset, our model is

$M = \left\{(f_i, m_i) \right\} \subset L^0(\Omega, {\cal B}, \mathbb{P}) \times {\cal M}_F$

Hence, our model is a collection of tuples of real-valued random variables combined with a subset of the factors ${\cal F}$ , each of which are also functions defined on $\Omega$ .

Design matrix construction¶

Given a description of our model, i.e. a set of tuples

$M = \left\{(f,m): f \in L^0(\Omega, {\cal B}, \mathbb{P}), m \in {\cal M}_F \right\}$

we now must build a design matrix.

Formally, this corresponds to constructing a function $g:\Omega \rightarrow \mathbb{R}^k$ from $M$ , and then evaluating this function at a point $(\omega_1, \dots, \omega_n) \in \Omega^n$ which yields a matrix

$D = \begin{pmatrix} g(\omega_1) \\ \vdots \\ g(\omega_n) \\ \end{pmatrix}_{n \times k}$

Part of the construction is also to determine a well-defined set of names for each column of $D$ .

Given a linear, or total ordering on the random variables we see in $M$ , i.e.

$\left\{f: \exists (f,s) \in M \right\}$

the process of building this design matrix can be carried out in stages based on the functions $f$ . Let the ordering be determined as $f_1 < f_2 < \dots < f_{\# M}$ . For each $m \in {\cal M}_F$ , we construct a function $g_m:\Omega \rightarrow \mathbb{R}^{k(m)}$ and the final function is determined by concatenation in the order previously determined

$g(\omega) = \begin{pmatrix} f_1 \cdot g_{m_1}(\omega) & f_2 \cdot g_{m_2}(\omega) & \dots & f_{\# M} \cdot g(m_{\# M})(\omega) \end{pmatrix}$

where the multiplication above is elementwise.

In R this linear ordering seems to be based on the name of the variable in the expression, with 1, the intercept coming before all others

> z = rnorm(40)
> lm(y~x+z)$coef
  (Intercept)             x             z 
 0.0429108507 -0.0007459873 -0.1763908600 
> b = x
> a = z
> lm(y~a+b)$coef
  (Intercept)             a             b 
 0.0429108507 -0.1763908600 -0.0007459873 
> lm(y~(x+z)*f)$coef
(Intercept)           x           z          f2        x:f2        z:f2 
  0.2353311   0.5266701  -0.7416684  -0.2650608  -0.4068248   0.9817334 
> lm(y~(a+b)*f)$coef
(Intercept)           a           b          f2        a:f2        b:f2 
  0.2353311  -0.7416684   0.5266701  -0.2650608   0.9817334  -0.4068248 

To construct the coding of the factors, R uses rules as described in Hastie & Chambers (1988) meant to satisfy users expectations in common statistical settings. For instance, in the case of two factors with one nested in the other, this rule produces the “expected result”.

In our notation, constructing this matrix is equivalent to constructing a function $g_m:\Omega \rightarrow \mathbb{R}^{k(m)}$ . Here, $k(m)$ is the number of real-valued functions, i.e. random variables, specified by $m$ which may be viewed as a collection of subsets of ${\cal F}$ .

The collection of subsets of ${\cal F}$ is naturally graded, i.e. $m$ can be partially ordered in terms of the size of each subset on $m$ . Let $s_1 \leq s_2 \leq \dots \leq s_{\# m}$ represent this ordering where each $s_i = \{f_{j}\} \subset {\cal F}, i \geq 1$ and, possibly, $s_1 = \emptyset$ . Now, each $s_i$ will contribute a certain number of random variables to $g_m$ that are concatenated together. The number of random variables depends on how each factor $f_{j} \in s_i$ is coded. The coding of a factor $f_j$ is defined as a choice of a random vector $h_j:\Omega \rightarrow \mathbb{R}^{d(f_j)}$ . The types of coding fall into two groups either indicator or contrast. If a factor is coded with indicator variables, then $d(f_j) = \# L_{j}$ , the number of levels of $f_j$ . If a factor is coded with contrast variables, then $d(f_j)=\# L_j-1$ .

For each factor $f_j$ , and a linear ordering of $L_j = \{v_{1,j} < v_{2,j} < \dots < v_{\# L_j,j} \}$ we can define $\# L_j$ binary random variables

$b_{jl}(\omega) = \begin{cases} 1 &= f_j(\omega) = v_{l,j} \end{cases}$

An indicator coding is any set of $\# L_j$ functions whose linear span coincides with the linear span of $b_{jl}, 1 \leq l \leq \# L_j$ . Usually, these functions are just taken to be the $b_{jl}$ ‘s themselves.

A contrast coding is a set of $\# L_j-1$ independent linear combinations of the $b_{jl}$ chosen in various ways, each with different interpretations.

> t(contr.helmert(4))
      1  2  3 4
[1,] -1  1  0 0
[2,] -1 -1  2 0
[3,] -1 -1 -1 3
> t(contr.poly(4))
         [,1]       [,2]       [,3]      [,4]
.L -0.6708204 -0.2236068  0.2236068 0.6708204
.Q  0.5000000 -0.5000000 -0.5000000 0.5000000
.C -0.2236068  0.6708204 -0.6708204 0.2236068
> t(contr.sum(4))
     1 2 3  4
[1,] 1 0 0 -1
[2,] 0 1 0 -1
[3,] 0 0 1 -1
> t(contr.treatment(4))
  1 2 3 4
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
> t(contr.SAS(4))
  1 2 3 4
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0

For instance, contr.sum (which corresponds to nipy‘s main_effect) uses the random variables

$b_{j,l} -b_{j,\#L_j}$

The rule R uses to construct the codings for a given $m = s_1 \leq s_2 \leq \dots \leq s_{\# m}$ produces a set of random variables whose linear span is equivalent to the linear span of

$\left\{ \prod_{f_j \in s_i} \prod_{1 \leq l \leq \# L_j} b_{lj}, 1 \leq i \leq \# m \right\}$

The way it produces these columns depends on a linear ordering of ${\cal F}$ . Let $f_1 < f_2 < \dots < \# {\cal F}$ be this linear ordering. This, in turn, determines a linear ordering on all subsets of ${\cal F}$ . Note that this linear ordering corresponds to representing sets as sorted tuples, so that $m$ is a sorted list of sorted tuples.

We therefore now assume that $m$ has been linearly ordered as $s_1 < s_2 < \dots < s_{\# m}$ .

This next step is a bit of a mouthful, but, here goes for any $s \subset {\cal F}$ , the linear span of the random variables

$\left\{ \prod_{f_j \in s} \prod_{1 \leq l \leq \# L_j} b_{l,j} \right\},$

is equivalent to the linear span of

$\left\{ \prod_{f_j \in \tilde{s}} \prod_{1 \leq l \leq \# L_j} (b_{l,j} - b_{l,\#L_j}: 0 \subseteq \tilde{s} \subseteq s \right\}$

In words, the linear space spanned by using the indicator coding for each factor in $s \subset {\cal F}$ is equivalent to the column space spanned by using the contrast coding for each $\tilde{s} \subseteq s \subset {\cal F}$ and concatenating them all together.

Hence, the column space spanned by using the indicator coding for each factor in both of $s_1,s_2$ is equivalent to using the contrast coding for each factor in every subset $\tilde{s} \subseteq s_1$ or $\tilde{s} \subseteq s_2$ . The collection of such subsets forms an abstract simplicial complex with maximal simplices $\{s_1,s_2\}$ .

R effectively constructs the function $g_m$ sequentially, based on stages $g_m^{(i)}, 1 \leq i \leq \# m$ in such a way that, for each $1 \leq i \leq \# m$ , the linear span of the coordinate functions of of $g^{(i)}_m$ is the same as the linear span of using all indicator codings for each factor in each of the subsets $[s_1,\dots, s_i]$ .

By construction, then, the coordinate functions of the final function $g_m=g_m^{(m)}$ span the same space as if it has used the indicator codings for each factor in each of the $[s_1, \dots, s_{\# m}]$ .

The rule can be expressed in terms of operations on simplicial complexes:

def factor_codings(*factor_monomials):
    """ Find which factors to code with indicator or contrast variables

    Determine which factors to code with indicator variables (using
    len(factor.levels) columns of 0s and 1s) or contrast coding (using
    len(factor.levels)-1).  The elements of the sequence should be tuples of
    strings.  Further, the factors are assumed to be in *graded* order, that is
    [len(f) for f in factor_monomials] is assumed non-decreasing.

    Examples
    --------
    >>> factor_codings(('b',), ('a',), ('b', 'c'), ('a','b','c'))
    {('b', 'c'): [('b', 'indicator'), ('c', 'contrast')], ('a',): [('a', 'contrast')], ('b',): [('b', 'indicator')], ('a', 'b', 'c'): [('a', 'contrast'), ('b', 'indicator'), ('c', 'indicator')]}
    >>> factor_codings(('a',), ('b',), ('b', 'c'), ('a','b','c'))
    {('b', 'c'): [('b', 'indicator'), ('c', 'contrast')], ('a',): [('a', 'indicator')], ('b',): [('b', 'contrast')], ('a', 'b', 'c'): [('a', 'contrast'), ('b', 'indicator'), ('c', 'indicator')]}

    Here is a version with debug strings to see what is happening:

    >>> factor_codings(('a',), ('b', 'c'), ('a','b','c')) #doctest: +SKIP
    Adding a from ('a',) as indicator because we have not seen any factors yet.
    Adding b from ('b', 'c') as indicator because set([('c',), ()]) is not a subset of set([(), ('a',)])
    Adding c from ('b', 'c') as indicator because set([(), ('b',)]) is not a subset of set([(), ('a',)])
    Adding a from ('a', 'b', 'c') as contrast because set([('c',), ('b', 'c'), (), ('b',)]) is a subset of set([('b', 'c'), (), ('c',), ('b',), ('a',)])
    Adding b from ('a', 'b', 'c') as indicator because set([('c',), (), ('a', 'c'), ('a',)]) is not a subset of set([('b', 'c'), (), ('c',), ('b',), ('a',)])
    Adding c from ('a', 'b', 'c') as indicator because set([('a', 'b'), (), ('b',), ('a',)]) is not a subset of set([('b', 'c'), (), ('c',), ('b',), ('a',)])
    {('b', 'c'): [('b', 'indicator'), ('c', 'indicator')], ('a',): [('a', 'indicator')], ('a', 'b', 'c'): [('a', 'contrast'), ('b', 'indicator'), ('c', 'indicator')]}

    Notes
    -----
    Even though the elements of factor_monomials are assumed to be in graded
    order, the final result depends on the ordering of the strings of the
    factors within each of the tuples.
    """
    lmax = 0
    from copy import copy
    already_seen = set([])
    final_result = []
    for factor_monomial in factor_monomials:
        result = []
        factor_monomial = list(factor_monomial)
        if len(factor_monomial) < lmax:
            raise ValueError('factors are assumed to be in graded order')
        lmax = len(factor_monomial)

        for j in range(len(factor_monomial)):
            cur = copy(list(factor_monomial))
            cur.pop(j)
            terms = simplicial_complex(cur)[2]
            if already_seen and set(terms).issubset(already_seen):
                result.append((factor_monomial[j], 'contrast'))
            else:
                result.append((factor_monomial[j], 'indicator'))
        already_seen = already_seen.union(simplicial_complex(factor_monomial)[2])
        final_result.append((tuple(factor_monomial), result))
    return dict(final_result)

These types of expressions are used in R to construct design matrices.

> c = factor(levels=c('warm', 'hot'))
> a + a:b + b:c + a:b:c
Error in b:c : NA/NaN argument
In addition: Warning messages:
1: In a:b : numerical expression has 40 elements: only the first used
2: In a:b : numerical expression has 40 elements: only the first used
3: In b:c : numerical expression has 40 elements: only the first used

I

However, R does not meaningfully handle this product. In order to construct a design matrix, R tries to come up with a smart parameterization of all the columns represented in the formula above. Note also that I overwrote c, one of R‘s most fundamental objects, oops.

Here is an example where the product is made sense of and used in a linear model.

> a = factor(rep(c(1,1,1,1,2,2,2,2,3,3,3,3), 10))
> b = factor(rep(c(1,1,2,2,1,1,2,2,1,1,2,2),10))
> cc = factor(rep(c('hot','warm','hot','warm','hot','warm','hot','warm','hot' 
,'warm','hot','warm'),10))
> Y = rnorm(120)

> lm1 = lm(Y~a+b+cc+a:b:cc-1)
> print(lm1$coef)
          a1           a2           a3           b2       ccwarm  a1:b1:cchot 
  -2.2061381   -2.2196141   -1.2529329    1.0173781    1.0550321    2.1112981 
 a2:b1:cchot  a3:b1:cchot  a1:b2:cchot  a2:b2:cchot  a3:b2:cchot a1:b1:ccwarm 
   1.5869000    1.3813130    0.8630796    1.2249425           NA    0.8034864 
a2:b1:ccwarm a3:b1:ccwarm a1:b2:ccwarm a2:b2:ccwarm a3:b2:ccwarm 
   1.1049333           NA           NA           NA           NA 

At this point, the algebraic tools being used are slightly different. The monoid structure is forgotten and we turn to working with simple polynomials. What R does is tries to find an alternative, but equivalent, expression for something like $a \cdot b \cdot c$ because it recognizes $a \cdot b \cdot c$ as maximal in the formula above even though $a$ as well as $a \cdot b$ and $b \cdot c$ also appear. In this case, equivalence means equivalence in the additive sense. That is, it constructs a second element of this monoid ring such that the column span, when computed, is identical to that of $a \cdot b \cdot c$ which is the pairwise product of the dummy indicator variables $a,b,c$ . The polynomials that R uses for this are generated by $\{a,b,c,a-\pmb{1},b-\pmb{1},c-\pmb{1}\}$ . An expression of the form $a-\pmb{1}$ indicates that R will use “contrasts” instead of “dummy variables” to encode this factor in the expression. There are several choices of contrast to use.

If the contrast encoding involves dropping a term, say, the term corresponding to the last level in a factor, then the expression

$(a-1)\cdot b \cdot (c-1)$

has columns

[a_2 * b_3 * c_hot, a_2 * b_4 * c_hot]

Note that this assumes an ordering of the levels of a factor. This is not a problem, but it makes some difference in terms of what columns of the design matrix are produced by R. For instance, if we changed “hot” to “S” for “sweltering” and “warm” to “C” for “comfy” in the levels of c then the design matrix would have columns

[a_2 * b_3 * c_C, a_2 * b_4 * c_C]

Therfore, the columns that R generates depend on the order of the levels of the factors, and we will see, they also depend on the order of the names of the factors.

> dd = factor(rep(c('S','C','S','C','S','C','S','C','S','C','S','C'),10))

> lm2 = lm(Y~a+b+dd+a:b:dd-1)
> print(lm2$coef)
        a1         a2         a3         b2        ddS  a1:b1:ddC  a2:b1:ddC 
 1.0932867  1.4416736  1.1834122 -0.3639349 -1.0550321 -1.4409063 -1.5013222 
 a3:b1:ddC  a1:b2:ddC  a2:b2:ddC  a3:b2:ddC  a1:b1:ddS  a2:b1:ddS  a3:b1:ddS 
-1.3813130 -0.8630796 -1.2249425         NA -0.1330946 -1.0193555         NA 
 a1:b2:ddS  a2:b2:ddS  a3:b2:ddS 
        NA         NA         NA 

> print(sum((fitted(lm1)-fitted(lm2))^2))
[1] 2.20623e-29

The point of these code snippets is to emphasize that the columns created by R depend on a total linear ordering of monomials whose names depend both on the factor names and the factor levels. Algebraically, the monomial

$(a-1)\cdot b \cdot (c-1)$

knows nothing about the levels of the factors so it does not know how to construct this design matrix. However, we can create a linear ordering of the monomials given a linear ordering of ${\cal F}$ .

The algorithm R uses to construct the columns of the design matrix is relatively straightforward, at least as described in Hastie & Chambers (1992), once the linear ordering of the monomials in the monoid ring is fixed, as well as the ordering of the levels of the factors. As the monoid ring is graded graded, there is a natural ordering of monomials in terms of the cardinality of the subsets they represent (remember our monoid ring is algebraically isomorphic to the semi-lattice of subsets of ${\cal F}$ with binary opertion $\cup$ ). This is just to emphasize that to implement R‘s algorithm, we must specify a total ordering within monomials of a fixed degree. R seems to use the name of the factor to create this ordering.

Given a sequence, $S$ of monomials in the monoid ring, R first creates a new sequence $N$ of monomials of the same length as $S$ in the monoid ring such that for each $i$ , the 'maximal elements' in $S[:i]$ and $N[:i]$ are identical. In turn, this means that the columns of the design matrix generated by $S[:i]$ and $N[:i]$ are identical. Therefore, the design matrix generated by $S$ and $N$ have the same column space.

It then uses this new sequence $N$ to ultimately construct the design matrix. The span of the columns of the design matrix will always include the span of the dummy encoding of this maximal monomial. Sometimes, it will yield a full-rank design matrix, and sometimes not.

Given a sequence of monomials in the monoid ring (not necessarily sorted) R‘s is essentially:

N = []
for idx, monomial in enumerate(S):
   new_monomial = 1
   for variable in monomial:
       margin = monomial / variable
       if margin in generated(S[:idx]):
           new_monomial *= (variable - 1)
       else:
           new_monomial *= variable
   N.append(new_monomial)

Formally, this ordering of the ring takes the expression

$a+a \cdot b + b \cdot c + a \cdot b \cdot c$

and yields sorted sequences of orders 0 through 3, in this case. In R‘s version, the sequences are sequences, rather then sets, i.e. the order matters. For our expression above, we have the following sequences:

S[0] = ['']
S[1] = ['a']
S[2] = ['ab','bc']
S[3] = ['abc']

# this gives a linear ordering
initial = S[0] + S[1] + S[2] + S[3]

The algorithm proceeds as

final = []
for i, p in enumerate(initial):
   q = p.copy()
   for s in p:
      q = q.replace(s, '')
      if q in genereated(initial[:i]):
         q = q.replace('(%s-1)' % s, s)
   final.append(q)

final = []
for i, p in enumerate(initial):
   q = p.copy()
   for s in p:
      if q.remove(s) in initial:
         q = q.replace('(%s-1)' % s, s)
   final.append(q)

Table Of Contents

Previous topic

Next topic

This Page

Formula¶

Factors¶

Equivalence relation¶

Numeric variables¶

Design matrix construction¶

Navigation

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Formula¶

Factors¶

Equivalence relation¶

Numeric variables¶

Design matrix construction¶

Navigation