A formula is basically a sympy expression for the mean of something of the form:
mean = sum([Beta(e)*e for e in expr])
Or, a linear combination of sympy expressions, with each one multiplied by its own “Beta”. The elements of expr can be instances of Term (for a linear regression formula, they would all be instances of Term). But, in general, there might be some other parameters (i.e. sympy.Symbol instances) that are not Terms.
The design matrix is made up of columns that are the derivatives of mean with respect to everything that is not a Term, evaluted at a recarray that has field names given by [str(t) for t in self.terms].
For those familiar with R’s formula syntax, if we wanted a design matrix like the following:
> s.table = read.table("http://www-stat.stanford.edu/~jtaylo/courses/stats191/data/supervisor.table", header=T)
> d = model.matrix(lm(Y ~ X1*X3, s.table)
)
> d
(Intercept) X1 X3 X1:X3
1 1 51 39 1989
2 1 64 54 3456
3 1 70 69 4830
4 1 63 47 2961
5 1 78 66 5148
6 1 55 44 2420
7 1 67 56 3752
8 1 75 55 4125
9 1 82 67 5494
10 1 61 47 2867
11 1 53 58 3074
12 1 60 39 2340
13 1 62 42 2604
14 1 83 45 3735
15 1 77 72 5544
16 1 90 72 6480
17 1 85 69 5865
18 1 60 75 4500
19 1 70 57 3990
20 1 58 54 3132
21 1 40 34 1360
22 1 61 62 3782
23 1 66 50 3300
24 1 37 58 2146
25 1 54 48 2592
26 1 77 63 4851
27 1 75 74 5550
28 1 57 45 2565
29 1 85 71 6035
30 1 82 59 4838
attr(,"assign")
[1] 0 1 2 3
>
With the Formula, it looks like this:
First read the same data as above:
>>> from os.path import dirname, join as pjoin
>>> import numpy as np
>>> import formula
>>> fname = pjoin(dirname(formula.__file__), 'data', 'supervisor.table')
>>> r = np.recfromtxt(fname, names=True)
Define the formula
>>> from formula import terms, Formula
>>> X1, X3 = terms(('X1', 'X3'))
>>> f = Formula([X1, X3, X1*X3, 1])
>>> f.mean
_b0*X1 + _b1*X3 + _b2*X1*X3 + _b3
The 1 is the “intercept” term, I have explicity not used R’s default of adding it to everything.
>>> f.design(r)
array([(51.0, 39.0, 1989.0, 1.0), (64.0, 54.0, 3456.0, 1.0),
(70.0, 69.0, 4830.0, 1.0), (63.0, 47.0, 2961.0, 1.0),
(78.0, 66.0, 5148.0, 1.0), (55.0, 44.0, 2420.0, 1.0),
(67.0, 56.0, 3752.0, 1.0), (75.0, 55.0, 4125.0, 1.0),
(82.0, 67.0, 5494.0, 1.0), (61.0, 47.0, 2867.0, 1.0),
(53.0, 58.0, 3074.0, 1.0), (60.0, 39.0, 2340.0, 1.0),
(62.0, 42.0, 2604.0, 1.0), (83.0, 45.0, 3735.0, 1.0),
(77.0, 72.0, 5544.0, 1.0), (90.0, 72.0, 6480.0, 1.0),
(85.0, 69.0, 5865.0, 1.0), (60.0, 75.0, 4500.0, 1.0),
(70.0, 57.0, 3990.0, 1.0), (58.0, 54.0, 3132.0, 1.0),
(40.0, 34.0, 1360.0, 1.0), (61.0, 62.0, 3782.0, 1.0),
(66.0, 50.0, 3300.0, 1.0), (37.0, 58.0, 2146.0, 1.0),
(54.0, 48.0, 2592.0, 1.0), (77.0, 63.0, 4851.0, 1.0),
(75.0, 74.0, 5550.0, 1.0), (57.0, 45.0, 2565.0, 1.0),
(85.0, 71.0, 6035.0, 1.0), (82.0, 59.0, 4838.0, 1.0)],
dtype=[('X1', '<f8'), ('X3', '<f8'), ('X1*X3', '<f8'), ('1', '<f8')])
A dummy symbol tied to a Term term
Methods
apart | |
args_cnc | |
as_base_exp | |
as_coeff_Mul | |
as_coeff_add | |
as_coeff_exponent | |
as_coeff_factors | |
as_coeff_mul | |
as_coeff_terms | |
as_coefficient | |
as_dummy | |
as_expr | |
as_independent | |
as_leading_term | |
as_numer_denom | |
as_ordered_factors | |
as_ordered_terms | |
as_poly | |
as_powers_dict | |
as_real_imag | |
as_terms | |
atoms | |
cancel | |
class_key | |
coeff | |
collect | |
combsimp | |
compare | |
compare_pretty | |
compute_leading_term | |
conjugate | |
could_extract_minus_sign | |
count | |
count_ops | |
diff | |
doit | |
dummy_eq | |
evalf | |
expand | |
extract_additively | |
extract_multiplicatively | |
factor | |
find | |
fromiter | |
getO | |
getn | |
has | |
integrate | |
invert | |
is_hypergeometric | |
is_polynomial | |
is_rational_function | |
iter_basic_args | |
leadterm | |
limit | |
lseries | |
match | |
matches | |
n | |
normal | |
nseries | |
nsimplify | |
powsimp | |
radsimp | |
ratsimp | |
refine | |
removeO | |
replace | |
rewrite | |
separate | |
series | |
simplify | |
sort_key | |
subs | |
together | |
trigsimp |
A Formula is a model for a mean in a regression model.
It is often given by a sequence of sympy expressions, with the mean model being the sum of each term multiplied by a linear regression coefficient.
The expressions may depend on additional Symbol instances, giving a non-linear regression model.
Methods
delete_terms | |
design | |
subs |
Coefficients in the linear regression formula.
Construct the design matrix, and optional contrast matrices.
Parameters : | input : np.recarray
param : None or np.recarray
return_float : bool, optional
contrasts : None or dict, optional
|
---|
The dtype of the design matrix of the Formula.
Expression for the mean, expressed as a linear combination of terms, each with dummy variables in front.
The parameters in the Formula.
Perform a sympy substitution on all terms in the Formula
Returns a new instance of the same class
Parameters : | old : sympy.Basic
new : sympy.Basic
|
---|---|
Returns : | newf : Formula |
Examples
>>> from formula import terms
>>> s, t = terms('s, t')
>>> f, g = [sympy.Function(l) for l in 'fg']
>>> form = Formula([f(t),g(s)])
>>> newform = form.subs(g, sympy.Function('h'))
>>> newform.terms
array([f(t), h(s)], dtype=object)
>>> form.terms
array([f(t), g(s)], dtype=object)
Terms in the linear regression formula.
Return a Formula(np.unique(self.terms))
Is obj a Beta?
Is obj a Formula?