Formula
=======

I am going to try and formalize a version of *R*'s formula
distinguishing between *numeric* and *factor*-like
variables with the distinction that, for a 
numeric variable :math:`x^2 \neq x` unlike
in *R*. It also will not have a `''-''` operation.

Factors
~~~~~~~

A *factor* is a categorical variable, and the usual 
interpretation of 

.. rcode::

   a=factor(levels=c(2,3,4))
   b=factor(levels=c(3,4))
   ~a*b

is as the Cartesian product of the columns
of indicator matrices for *a* and *b*. 
But in *R* the formula has meaning even not knowing
the levels of *a* and *b*. 
The expression '''a*b'''
above has a multiplication in it. *R* uses
two different multiplications, I will only use one, and
it corresponds to *R*'s symbol *:*.


Formally, let :math:`{\cal F}=\{f_1, \dots, f_k\}`
be a set of factors.
Multiplication automatically suggests some *algebraic* rules
for formulae. We usually think of
factors as obeying the rules

.. math::
   
    \begin{aligned}
    f_i\cdot f_i &= f_i  \\
    f_i\cdot (f_j\cdot f_k) &= (f_i\cdot f_j)\cdot f_k \\
    f_i\cdot f_j &= f_j\cdot f_i \\
    \pmb{1} \cdot f_i &= f_i
    \end{aligned}

The rules correspond to: idempotency, associativity and commutativity. 
If we added an inverse to multiplication, then the set of factors
would constitute a group. 
However,
there is no inverse to multiplication. Algebraically,
this means that the set of factors,
combined with this operation and the properties above generates an 
algebraic object known as a  `monoid <http://en.wikipedia.org/wiki/Monoid>`_.
Note, the :math:`\pmb{1}` above is not the constant 1, but the identity
in the monoid :math:`{\cal M}` generated by :math:`{\cal F}` and the
product ":math:`\cdot`".
Algebraically, this idempotent, commutative monoid :math:`{\cal M}` is 
isomorphic  to the `join-semilattice <http://en.wikipedia.org/wiki/Semilattice>`_
of subsets of :math:`{\cal F}` with join representing union.
In this view, :math:`\pmb{1}` corresponds to the emptyset :math:`\emptyset`.

In *R*'s formulae there is also an addition operation, *+*.
A monoid does not have addition, it only has one operation, multiplication.
In order to introduce addition, we need to introduce 
something like an `algebra <http://en.wikipedia.org/wiki/Algebra_over_a_field>`_.
If we had a group, the natural structure to consider is called
a group algebra or `group ring <http://en.wikipedia.org/wiki/Group_ring>`_. 
The corresponding notion
for a monoid is that of a `monoid ring <http://en.wikipedia.org/wiki/Monoid_ring>`_ or a *monoid algebra*, if the ring used in constructing the monoid ring is
commutative.
The monoid algebra is constructed from a monoid, which we already have, and 
a commutative ring. 

For the classical ANOVA models (no numeric variables),
we can take this ring to be :math:`\mathbb{Z}`. 
Below, when
we introduce things like *numeric* vectors in *R*
to our formula, we will
change the ring to be the ring of real-valued functions
on some space. 

The monoid algebra :math:`A` of our monoid :math:`{\cal M}` over :math:`\mathbb{Z}`
can be expressed in terms of sets of 2-tuples :math:`(i, m): i \in \mathbb{Z}, m \in {\cal M}` with multiplication, addition and subtraction defined
as:

.. math::

    \begin{aligned}
    \{(i,m)\} \cdot \{(j,m')\} &= \{(i\cdot j, m \cdot m')\} \\
    \{(i,m)\} + \{(j,m')\} &= \{(i, m), (j, m')\} \\
    \{(i,m)\} + \{(j,m)\} &= \{(i+j,m)\} \\
    \end{aligned}

If we introduce variables :math:`1_m` then these rules can
be written as

.. math::

    \begin{aligned}
     \left(i \cdot 1_m \right) \cdot \left(j \cdot 1_{m'} \right) &=
     \left(i \cdot j \right) 1_{m \cdot m'}\\
    &= \{(i\cdot j, m \cdot m')\} \\
    \{(i, m), (j, m')\} &= i \cdot 1_m + j \cdot 1_{m'} \\
    \{(i,m)\} + \{(j,m)\} &= i \cdot 1_m + j \cdot 1_m \\
    &= (i+j) 1_m
    \end{aligned}

As our monoid is idempotent and commutative, we arrive at the
conclusion that

.. math::

     \prod_{j \in J}\left(i_j,  m_j^{a_j} \right) = \left(\prod_{j \in J} i_j, \prod_{j \in J} m_j \right) \qquad  a=(a_j)_{j \in J} \in \mathbb{N}^J

This places *R*'s convention 

.. rcode::

    a*a

into a meaningful algebraic setting.

Actually, *R* does this for numeric variables as well, unless
one uses the *I* notation. 

.. rcode::

   terms(formula(~x^2))
   terms(formula(~I(x^2)))

Equivalence relation
~~~~~~~~~~~~~~~~~~~~

Unfortunately, we cannot make sense
of *R*'s convention that :math:`m+m=m` in the monoid algebra.
There are a few things wrong with the expression :math:`m+m`
if we take :math:`m` to be elements of the monoid :math:`{\cal M}`.
The most important thing is to note that :math:`m` is not an element
of the algebra :math:`A`, which is the structure where we have
defined addition.
However, we can define an equivalence relation
:math:`\sim` on :math:`A` as follows

.. .. math::

..     \left\{(i_r, m_r): r \in R \right\} \sim \left\{(i'_s, m'_s): s \in S \right\} \iff \left\{m_r: r \in R \right\} = \left\{m'_s: s \in S\right\}, \qquad \forall i_r, i_s \neq 0

Let the map to equivalence classes be denoted by :math:`\pi:A \rightarrow \{0,1\}^{{\cal M}}`. Then,
*R*'s convention :math:`m+m=m` is equivalent to

.. math::

    \left\{(i, m) \right\} \sim \left\{(j, m) \right\} \qquad \forall i,j \neq 0 \in \mathbb{Z} 

or 

.. math::

   \pi\left(\left\{(i,m)\right\}\right) = \pi\left(\left\{(j,m)\right\} \right), \qquad \forall i, j \neq 0

The set of equivalence classes has a well-defined notion
of addition. Given two equivalence classes :math:`s_1, s_2 \in \{0,1\}^{\cal M}`
we define

.. math::

    s_1 + s_2 = \pi \left(\pi^{-1}(s_1) + \pi^{-1}(s_2) \right)

where :math:`\pi^{-1}(s)` is a *generic* element of the equivalence 
class :math:`s`. Since our algebra is an algebra over :math:`\mathbb{Z}`,
we can take the generic elements to be random integers, say in the range
[-1e6,1e6]. This means that this notion of the
sum of equivalence classes will break
down occasionally, in the sense
that, with probability roughly 4e-12, :math:`\pi(\{(1,m)\}) + \pi(\{(1,m)\}) = 0`.
In the implementation of this in `nipy`_, the ring we take can be thought of
as the ring of `sympy`_ dummy symbols, so, up to memory errors,
:math:`\{m\} + \{m\}=\{m\}` always.

The set of equivalence classes :math:`\{0,1\}^{\cal M}` also has a well
defined notion of product

.. math::

    s_1 \cdot s_2 = \pi \left(\pi^{-1}(s_1) \cdot \pi^{-1}(s_2) \right)

This, too, may occasionally break down depending on how
we define our *generic* element :math:`\pi^{-1}(s)`. If we take
the ring to be :math:`\mathbb{R}` and generate a generic element
by drawing from a random variable with a continuous density, like the
standard normal :math:`N(0,1)`, then with probability 1, the rules
of addition and multiplication will not break down if we perform
countably many such operations. 

It is not hard to see that the multiplication and addition
are associative, commutative and distributive on :math:`\{0,1\}^{\cal M}`.

While addition and multiplication are well-defined on the set of 
equivalence classes, there is no corresponding notion of subtraction
of equivalence classes because, generically,

.. math::

    \pi(\pi^{-1}(s_1) - \pi^{-1}(s_2)) = \pi(\pi^{-1}(s_1) + \pi^{-1}(s_2))  

Therefore, we cannot call the set of equivalence classes a ring
or any other familiar object. As our set of equivalence classes
is just :math:`\{0,1\}^{\cal M}` we can delete
the terms from :math:`s_d` that appear in :math:`s`
by the elementwise multiplication

.. math::

     s(m) * (1 - s_d(m)), \qquad m \in {\cal M}

Or, regarding :math:`s,s_d` as subsets of :math:`{\cal M}`, deletion
of terms is just the operation :math:`s \cap s_d^c`.

We have now given a formal way of describing which sets of
factors are to be included in a regression model. 
The set of factors to be included in a regression model
is specified by an element of :math:`\{0,1\}^{\cal M}` where
the monoid :math:`{\cal M}` is the semilattice of subsets of `:math:{\cal F}`.
On the set of factor specifications :math:`\{0,1\}^{\cal M}`, we have
defined two operations  that obey the commutative, distributive
and associative properties.


Numeric variables
~~~~~~~~~~~~~~~~~

Above, we described rules for specifying which
factors are to be included in the model. One can take the same
approach for numeric variables, which is essentially what *R* does.
However, *R* ultimately treats numeric and factors differently
when it comes to building the design matrix. For instance, 

In the first model below, subtracting 1 yields fitted values orthogonal to
the constant, while this is not true in the second model. 

.. rcode::

    x = rnorm(40)
    y = rnorm(40)
    f = factor(rep(c(1,2),20))

    r1 = resid(lm(y~f-1))
    print(sum(r1))
    r2 = resid(lm(y~x-1))
    print(sum(r2))

There is a perfectly good explanation for this if one uses
*R*'s rules for constructing a design matrix
from an element of :math:`\{0,1\}^{\cal M}`, as we will see below.

However, it does mean that factors and numerics are treated
separately. One way to address this is to consider the monoid

.. math::

    {\cal M}_N \times {\cal M}_F

where :math:`{\cal M}_F` is a monoid
for *factors* that is essentially the same monoid we 
saw in the previous section
and :math:`{\cal M}_N` is a monoid for *numerical* variables. The multiplication in this monoid is defined componentwise

.. math::

    (m_{N,1}, m_{F,1}) \times (m_{N,2}, m_{F,2}) = (m_{N,1} \cdot m_{N,2}, m_{F,1} \cdot m_{F,2})

The only remaining issue is how do we define multiplication in the monoid
:math:`{\cal M}_N`. *R* uses the same idempotency rules as for factors, 
though it seems more natural to use true multiplication instead.

.. rcode::

    lm(y~x*x)$coef
    lm(y~x)$coef

This leaves room for some weird manipulations 

.. rcode::

    lm(y~x*log(exp(x)))
    lm(y~x + I(x^2))

Using a symbolic system, like `sympy`_ can be helpful in cases like this.

Before continuing, 
let us make a slight modification of the previous monoid
that recognizes the fact that these objects
are meant to represent random variables, that is functions on some
sample space :math:`\Omega`. The process of constructing the 
design matrix is then just the evaluation of these functions
at some point :math:`\omega \in \Omega`. If we want to think
of a more concrete probability model, we can think
of a *data.frame* with :math:`n` rows as a point in :math:`\Omega^n`
and our probability distribution might correspond to IID samples
from some probability measure :math:`(\Omega, {\cal B}, \mathbb{P})`.
Of course, constructing the design matrix is completely
separate from specifying the joint distribution on :math:`\Omega^n`, but
thinking of factors and numeric variables as functions 
makes things more concrete.

In this new view, a *factor* from the previous section is now defined as a
function :math:`f_i :\Omega \rightarrow L_i` where :math:`L_i` is the set of
levels of the factor.  We define a *numeric* variable as a real-valued function
on :math:`\Omega`.

With this definition, :math:`{\cal F}` is a (finite) collection of functions
on :math:`\Omega`, each taking values in a different space, and
:math:`{\cal M}_F` is the same monoid as before. Therefore, 
a member of the set of equivalence classes :math:`\{0,1\}^{{\cal M}_F}`
can be viewed as a set of functions

.. math::

    \left\{f_i:\Omega \rightarrow L_i \right\} \subset {\cal F}
   
We might then take :math:`{\cal M}_N={\mathbb{R}}^{\Omega}` to be the monoid
of real-valued functions on :math:`\Omega` with multiplication
defined in the obvious way

.. math::

    (f_1 \cdot f_2)(\omega) = f_1(\omega) \cdot f_2(\omega)

There is, however, a possible measure theoretic
complication to consider as we should properly consider
equivalence classes on :math:`\mathbb{R}^{\Omega}` defined
by equality up to sets of probability 0 according to some
probability space :math:`(\Omega, {\cal B}, \mathbb{P})`.
Let us therefore fix this probability space 
and take :math:`{\cal M}_N = L^0(\Omega, {\cal B}, \mathbb{P})`, 
the collection of random variables on :math:`(\Omega, {\cal B}, \mathbb{P})`.
 
Carrying out the same construction of equivalence classes
as for factors, our model is represented as a member of
:math:`\{0,1\}^{L^0(\Omega, {\cal B}, \mathbb{P}) \times {\cal M}_F}`.
Viewing this as a subset, our model is

.. math::

    M = \left\{(f_i, m_i) \right\} \subset L^0(\Omega, {\cal B}, \mathbb{P}) \times {\cal M}_F

Hence, our model is a collection of tuples of real-valued random variables
combined with a subset of the factors :math:`{\cal F}`, each
of which are also functions defined on :math:`\Omega`.

Design matrix construction
~~~~~~~~~~~~~~~~~~~~~~~~~~

Given a description of our model, i.e. a set of tuples

.. math::

    M = \left\{(f,m): f \in L^0(\Omega, {\cal B}, \mathbb{P}), m \in {\cal M}_F \right\}

we now must build a design matrix. 

Formally, this corresponds to constructing a function
:math:`g:\Omega \rightarrow \mathbb{R}^k` from  :math:`M`,
and then evaluating this function at a
point :math:`(\omega_1, \dots, \omega_n) \in \Omega^n` which yields a matrix

.. math::

    D = \begin{pmatrix}
    g(\omega_1) \\
    \vdots \\
    g(\omega_n) \\
    \end{pmatrix}_{n \times k}

Part of the construction is also to determine
a well-defined set of names for each column of :math:`D`.

Given a `linear, or total ordering <http://en.wikipedia.org/wiki/Total_order>`_ on the random variables
we see in :math:`M`, i.e.

.. math::

    \left\{f: \exists (f,s) \in M \right\}

the process of building this design matrix can 
be carried out in stages based on the functions :math:`f`.
Let the ordering be determined as :math:`f_1 < f_2 < \dots < f_{\# M}`. 
For each :math:`m \in {\cal M}_F`, we construct a function
:math:`g_m:\Omega \rightarrow \mathbb{R}^{k(m)}`
and the final function is determined by
concatenation in the order previously determined

.. math::

   g(\omega) = \begin{pmatrix}
   f_1 \cdot g_{m_1}(\omega) & f_2 \cdot g_{m_2}(\omega) &  \dots & f_{\# M} \cdot g(m_{\# M})(\omega)
   \end{pmatrix}

where the multiplication above is elementwise.

In *R* this linear ordering seems to be based on the 
name of the variable in the expression, with *1*, the
intercept coming before all others

.. rcode::

    z = rnorm(40)
    lm(y~x+z)$coef
    b = x
    a = z
    lm(y~a+b)$coef
    lm(y~(x+z)*f)$coef
    lm(y~(a+b)*f)$coef
 
To construct the coding of the factors, *R* uses rules as described in
`Hastie & Chambers (1988) <http://www.amazon.com/Statistical-Models-Chapman-Computer-Science/dp/0412052911>`_ meant to
satisfy users expectations in common statistical settings. For instance,
in the case of two factors with one nested in the other, this rule
produces the "expected result". 

In our notation, constructing this matrix is equivalent to constructing
a function :math:`g_m:\Omega \rightarrow \mathbb{R}^{k(m)}`. Here,
:math:`k(m)` is the number of real-valued functions, i.e. random variables, specified by :math:`m` which
may be viewed as a collection of subsets of :math:`{\cal F}`.

The collection of subsets of :math:`{\cal F}` is naturally *graded*, i.e.
:math:`m` can be partially ordered in terms of the size of each subset on
:math:`m`.  Let :math:`s_1 \leq s_2 \leq \dots \leq s_{\# m}` represent this
ordering where each :math:`s_i = \{f_{j}\} \subset {\cal F}, i \geq 1` and,
possibly, :math:`s_1 = \emptyset`.  Now, each :math:`s_i` will contribute a
certain number of random variables to :math:`g_m` that are concatenated
together. The number of random variables depends on how each factor :math:`f_{j}
\in s_i` is *coded*. The *coding* of a factor :math:`f_j` is defined as a choice
of a random vector :math:`h_j:\Omega \rightarrow \mathbb{R}^{d(f_j)}`. The types
of coding fall into two groups either *indicator* or *contrast*.  If a factor is
coded with *indicator* variables, then :math:`d(f_j) = \# L_{j}`, the number of
levels of :math:`f_j`. If a factor is coded with *contrast* variables, then
:math:`d(f_j)=\# L_j-1`.

For each factor :math:`f_j`, and a linear ordering of :math:`L_j = \{v_{1,j} <
v_{2,j} < \dots < v_{\# L_j,j} \}` we can define :math:`\# L_j` binary random
variables

.. math::

    b_{jl}(\omega) =
    \begin{cases}
    1 &= f_j(\omega) = v_{l,j}
    \end{cases}

An *indicator* coding is any set of :math:`\# L_j` functions whose linear span
coincides with the linear span of :math:`b_{jl}, 1 \leq l \leq \# L_j`. Usually,
these functions are just taken to be the :math:`b_{jl}`'s themselves.

A *contrast* coding is a set of :math:`\# L_j-1`  independent linear
combinations of the :math:`b_{jl}` chosen in various ways, each with different
interpretations.

.. rcode::

     t(contr.helmert(4))
     t(contr.poly(4))
     t(contr.sum(4))
     t(contr.treatment(4))
     t(contr.SAS(4))

For instance, *contr.sum* (which corresponds to *nipy*'s ``main_effect``)
uses the random variables

.. math::

    b_{j,l} -b_{j,\#L_j}

The rule *R* uses to construct the codings for 
a given :math:`m = s_1 \leq s_2 \leq \dots \leq s_{\# m}` produces a set of random
variables whose linear span is equivalent to the linear span of

.. math::

    \left\{ \prod_{f_j \in s_i} \prod_{1 \leq l \leq \# L_j} b_{lj}, 1 \leq i \leq \# m \right\}

The way it produces these columns depends on a linear
ordering of :math:`{\cal F}`. Let :math:`f_1 < f_2 < \dots < \# {\cal F}`
be this linear ordering. This, in turn, determines a linear ordering
on all subsets of :math:`{\cal F}`. Note that this linear ordering 
corresponds to 
representing sets as sorted tuples, so that
:math:`m` is a sorted list of sorted tuples.

We therefore now assume that :math:`m` has been linearly ordered
as :math:`s_1 < s_2 < \dots < s_{\# m}`. 

This next step is a bit of a mouthful, but, here goes
for any :math:`s \subset {\cal F}`, the linear span of the random variables

.. math::

    \left\{ \prod_{f_j \in s} \prod_{1 \leq l \leq \# L_j} b_{l,j} \right\},

is equivalent to the linear span of

.. math::

    \left\{ \prod_{f_j \in \tilde{s}} \prod_{1 \leq l \leq \# L_j} (b_{l,j} - b_{l,\#L_j}: 0 \subseteq \tilde{s} \subseteq s \right\}

In words, the linear space spanned by using the indicator coding
for each factor in :math:`s \subset {\cal F}` is equivalent to the column
space spanned by using the contrast coding
for  each :math:`\tilde{s} \subseteq s \subset {\cal F}` and concatenating
them all together.

Hence, the column space spanned by using the indicator coding
for each factor in both of :math:`s_1,s_2` is equivalent
to using the contrast coding for each factor in 
every subset :math:`\tilde{s} \subseteq s_1` or :math:`\tilde{s} \subseteq s_2`. The collection of such subsets forms an `abstract
simplicial complex <http://en.wikipedia.org/wiki/Abstract_simplicial_complex>`_ with maximal simplices :math:`\{s_1,s_2\}`. 

*R* effectively constructs the function :math:`g_m` sequentially,
based on stages :math:`g_m^{(i)}, 1 \leq i \leq \# m`
in such a way that, for each :math:`1 \leq i \leq \# m`, the linear span of the coordinate
functions of of :math:`g^{(i)}_m` is the same as the linear span of
using all indicator codings for each factor in each of the subsets :math:`[s_1,\dots, s_i]`.

By construction, then, the coordinate functions of the final function :math:`g_m=g_m^{(m)}` span the same space as if it has used the indicator
codings for each factor in each of the :math:`[s_1, \dots, s_{\# m}]`.

The rule can be expressed in terms of operations on simplicial complexes:

.. literalinclude:: ../formula/utils.py
    :pyobject: factor_codings

These types of expressions are used in *R* to construct design matrices.

.. rcode::

   c = factor(levels=c('warm', 'hot'))
   a + a:b + b:c + a:b:c

I

However, *R* does not meaningfully handle this product.
In order to construct a design matrix, *R* tries to come
up with a smart parameterization of all the columns 
represented in the formula above. Note also that I overwrote
*c*, one of *R*'s most fundamental objects, oops.

Here is an example where the product is made sense of 
and used in a linear model.

.. rcode::

   a = factor(rep(c(1,1,1,1,2,2,2,2,3,3,3,3), 10))
   b = factor(rep(c(1,1,2,2,1,1,2,2,1,1,2,2),10))
   cc = factor(rep(c('hot','warm','hot','warm','hot','warm','hot','warm','hot','warm','hot','warm'),10))
   Y = rnorm(120)

   lm1 = lm(Y~a+b+cc+a:b:cc-1)
   print(lm1$coef)

At this point, the algebraic tools being used are slightly different.  The
monoid structure is forgotten and we turn to working with simple polynomials.
What *R* does is tries to find an alternative, but equivalent, expression for
something like :math:`a \cdot b \cdot c` because it recognizes :math:`a \cdot b
\cdot c` as *maximal* in the formula above even though :math:`a` as well as
:math:`a \cdot b` and :math:`b \cdot c` also appear.  In this case, equivalence
means equivalence in the additive sense. That is, it constructs a second element
of this monoid ring such that the column span, when computed, is identical to
that of :math:`a \cdot b \cdot c` which is the pairwise product of the dummy
indicator variables :math:`a,b,c`. The polynomials that *R* uses for this are
generated by :math:`\{a,b,c,a-\pmb{1},b-\pmb{1},c-\pmb{1}\}`.  An expression of
the form :math:`a-\pmb{1}` indicates that *R* will use "contrasts" instead of
"dummy variables" to encode this factor in the expression. There are several
choices of *contrast* to use.

If the contrast encoding involves dropping a term, say, the term corresponding
to the last level in a factor, then the expression

.. math::

   (a-1)\cdot b \cdot (c-1)

has columns

.. code-block:: python

   [a_2 * b_3 * c_hot, a_2 * b_4 * c_hot]

Note that this assumes an *ordering* of the levels of a factor. 
This is not a problem, but it makes some difference in terms of what
columns of the design matrix are produced by *R*. For instance,
if we changed "hot" to "S" for "sweltering" and "warm" to "C" for "comfy" in the levels of *c* then the design matrix would have columns

.. code-block:: python

   [a_2 * b_3 * c_C, a_2 * b_4 * c_C]

Therfore, the columns that *R* generates depend on
the order of the levels of the factors, and we will
see, they also depend on the order of the names of the factors.

.. rcode::

   dd = factor(rep(c('S','C','S','C','S','C','S','C','S','C','S','C'),10))

   lm2 = lm(Y~a+b+dd+a:b:dd-1)
   print(lm2$coef)

   print(sum((fitted(lm1)-fitted(lm2))^2))


The point of these code snippets is to emphasize that
the columns created by *R* depend on a total
linear ordering of monomials whose names depend
both on the *factor names* and the *factor levels*.
Algebraically, the monomial

.. math::

   (a-1)\cdot b \cdot (c-1)

knows nothing about the *levels* of the factors so it
does not know how to construct this design matrix. However,
we can create a linear ordering of the 
monomials given a linear ordering of :math:`{\cal F}`.

The algorithm *R* uses to construct the columns of the design matrix is
relatively straightforward, at least as described in Hastie \& Chambers (1992),
once the linear ordering of the monomials in the monoid ring is fixed, as well
as the ordering of the levels of the factors.  As the monoid ring is graded
`graded <http://en.wikipedia.org/wiki/Graded_algebra>`_, there is a natural
ordering of monomials in terms of the cardinality of the subsets they represent
(remember our monoid ring is algebraically isomorphic to the semi-lattice of
subsets of :math:`{\cal F}` with binary opertion :math:`\cup`).  This is just to
emphasize that to implement *R*'s algorithm, we must specify a total ordering
within monomials of a fixed degree. *R* seems to use the name of the factor to
create this ordering.

Given a sequence, :math:`S` of monomials in the monoid ring, *R* first creates a
new sequence :math:`N` of monomials of the same length as :math:`S` in the
monoid ring such that for each :math:`i`, the ``'maximal elements'`` in
:math:`S[:i]` and :math:`N[:i]` are identical. In turn, this means that the
columns of the design matrix generated by :math:`S[:i]` and :math:`N[:i]` are
identical. Therefore, the design matrix generated by :math:`S` and :math:`N`
have the same column space.

It then uses this new sequence :math:`N` to ultimately construct the design
matrix. The span of the columns of the design matrix will always include the
span of the dummy encoding of this maximal monomial. Sometimes, it will yield a
full-rank design matrix, and sometimes not.

Given a sequence of monomials in the monoid ring (not necessarily sorted) *R*'s
is essentially:

.. code-block:: python

   N = []
   for idx, monomial in enumerate(S):
      new_monomial = 1
      for variable in monomial:
          margin = monomial / variable
          if margin in generated(S[:idx]):
              new_monomial *= (variable - 1)
          else:
              new_monomial *= variable
      N.append(new_monomial)

Formally, this ordering of the ring takes the expression 

.. math::

   a+a \cdot b + b \cdot c + a \cdot b \cdot c

and yields sorted sequences of orders 0 through 3, in this case.  In *R*'s
version, the sequences are sequences, rather then sets, i.e. the order matters.
For our expression above, we have the following sequences:

.. code-block:: python

   S[0] = [''] 
   S[1] = ['a'] 
   S[2] = ['ab','bc'] 
   S[3] = ['abc']

   # this gives a linear ordering
   initial = S[0] + S[1] + S[2] + S[3]

The algorithm proceeds as

.. code-block:: python

   final = []
   for i, p in enumerate(initial):
      q = p.copy()
      for s in p:
         q = q.replace(s, '')
         if q in genereated(initial[:i]):
            q = q.replace('(%s-1)' % s, s)
      final.append(q)


.. code-block:: python

   final = []
   for i, p in enumerate(initial):
      q = p.copy()
      for s in p:
         if q.remove(s) in initial:
            q = q.replace('(%s-1)' % s, s)
      final.append(q)

.. R always enforces removal of a constant...

.. > lm2

.. Call:
.. lm(formula = Y ~ a + b + c - 1)

.. Coefficients:
..       a1        a2        b2        c2  
.. -0.02231  -0.22612  -0.09531   0.09841  

.. >