Mathematical optimization: finding minima of functions

Mathematical optimization: finding minima of functions#

Authors: Gaël Varoquaux

Mathematical optimization deals with the problem of finding numerically minimums (or maximums or zeros) of a function. In this context, the function is called cost function, or objective function, or energy.

Here, we are interested in using scipy.optimize for black-box optimization: we do not rely on the mathematical expression of the function that we are optimizing. Note that this expression can often be used for more efficient, non black-box, optimization.

Prerequisites

Knowing your problem#

Not all optimization problems are equal. Knowing your problem enables you to choose the right tool.

Dimensionality of the problem

The scale of an optimization problem is pretty much set by the dimensionality of the problem, i.e. the number of scalar variables on which the search is performed.

Convex versus non-convex optimization#

../../_images/1a2892655a216937ebc0d3812415391100a8e57df637c93ad1e24660bb134457.png

../../_images/59e1ddb3216ac365374da4dabf7d69a6d0533f41dbfa29a2cf75507cc001a83d.png

A convex function:

\(f\) is above all its tangents.
Equivalently, for two points \(A, B\), \(f(C)\) lies below the segment \([f(A), f(B])]\), if \(A < C < B\).

A non-convex function

Plot code

See convex, non-convex function plots.

Optimizing convex functions is easy. Optimizing non-convex functions can be very hard.

Note

It can be proven that for a convex function a local minimum is also a global minimum. Then, in some sense, the minimum is unique.

Smooth and non-smooth problems#

../../_images/fd055c8e135893ea18f2d1fa388db29987922eb487a7fa68a85ea97c26a7df27.png

../../_images/3a078911244208b8f71b2d8a7ca413c0f02a062b4801f7d2e31b98f46a6629ba.png

A smooth function:

The gradient is defined everywhere, and is a continuous function

A non-smooth function

Plot code

See smooth, non-smooth function plots.

Optimizing smooth functions is easier (true in the context of black-box optimization, otherwise Linear Programming is an example of methods which deal very efficiently with piece-wise linear functions).

Noisy versus exact cost functions#

Noisy (blue) and non-noisy (orange) functions

../../_images/bb421f951ec5c85a13e3e802eee381b440e0ab26869da1a8f6822d5e98abf5b1.png

Plot code

See noisy, non-noisy function plots.

Noisy gradients

Many optimization methods rely on gradients of the objective function. If the gradient function is not given, they are computed numerically, which induces errors. In such situation, even if the objective function is not noisy, a gradient-based optimization may be a noisy optimization.

Constraints#

Optimizations under constraints

Here:

\(-1 < x_1 < 1\)

\(-1 < x_2 < 1\)

../../_images/c6429c256bbe5d38fe59b7cb8ff82fac9ccd35e2cf286a23c77e28406a8acbd3.png

Plot code

See constraint plots.

A review of the different optimizers#

Getting started: 1D optimization#

Let’s get started by finding the minimum of the scalar function \(f(x)=\exp[(x-0.5)^2]\). scipy.optimize.minimize_scalar() uses Brent’s method to find the minimum of a function:

def f(x):
    return -np.exp(-(x - 0.5)**2)

result = sp.optimize.minimize_scalar(f)
result.success # check if solver was successful

True

x_min = result.x
x_min

np.float64(0.5000000058670102)

x_min - 0.5

np.float64(5.867010210991452e-09)

Table 7 **Brent’s method on a quadratic function**: it converges in 3 iterations, as the quadratic approximation is then exact.#

Table 8 **Brent’s method on a non-convex function**: note that the fact that the optimizer avoided the local minimum is a matter of luck.#

Plot code

See Brent’s method figures.

Note

You can use different solvers using the parameter method.

Note

scipy.optimize.minimize_scalar() can also be used for optimization constrained to an interval using the parameter bounds.

Gradient based methods#

Some intuitions about gradient descent#

Here we focus on intuitions, not code. Code will follow.

Gradient descent basically consists in taking small steps in the direction of the gradient, that is the direction of the steepest descent.

Table 9 **Fixed step gradient descent**#
A well-conditioned quadratic function.
An ill-conditioned quadratic function. The core problem of gradient-methods on ill-conditioned problems is that the gradient tends not to point in the direction of the minimum.

Plot code

See gradient descent plots.

We can see that very anisotropic (ill-conditioned) functions are harder to optimize.

Take home message: conditioning number and preconditioning

If you know natural scaling for your variables, prescale them so that they behave similarly. This is related to preconditioning.

Also, it clearly can be advantageous to take bigger steps. This is done in gradient descent code using a line search.

Table 10 **Adaptive step gradient descent**#
A well-conditioned quadratic function.
An ill-conditioned quadratic function.
An ill-conditioned non-quadratic function.
An ill-conditioned very non-quadratic function.

Plot code

See gradient descent plots.

The more a function looks like a quadratic function (elliptic iso-curves), the easier it is to optimize.

Conjugate gradient descent#

The gradient descent algorithms above are toys not to be used on real problems.

As can be seen from the above experiments, one of the problems of the simple gradient descent algorithms, is that it tends to oscillate across a valley, each time following the direction of the gradient, that makes it cross the valley. The conjugate gradient solves this problem by adding a friction term: each step depends on the two last values of the gradient and sharp turns are reduced.

Table 11 **Conjugate gradient descent**#
An ill-conditioned non-quadratic function.
An ill-conditioned very non-quadratic function.

Plot code

See gradient descent plots.

SciPy provides scipy.optimize.minimize() to find the minimum of scalar functions of one or more variables. The simple conjugate gradient method can be used by setting the parameter method to CG

def f(x):   # The rosenbrock function
    return .5*(1 - x[0])**2 + (x[1] - x[0]**2)**2

sp.optimize.minimize(f, [2, -1], method="CG")

 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: 1.6503729082243953e-11
       x: [ 1.000e+00  1.000e+00]
     nit: 13
     jac: [-6.153e-06  2.538e-07]
    nfev: 81
    njev: 27

Gradient methods need the Jacobian (gradient) of the function. They can compute it numerically, but will perform better if you can pass them the gradient:

def jacobian(x):
    return np.array((-2*.5*(1 - x[0]) - 4*x[0]*(x[1] - x[0]**2), 2*(x[1] - x[0]**2)))

sp.optimize.minimize(f, [2, 1], method="CG", jac=jacobian)

 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: 2.957865890641887e-14
       x: [ 1.000e+00  1.000e+00]
     nit: 8
     jac: [ 7.183e-07 -2.990e-07]
    nfev: 16
    njev: 16

Note that the function has only been evaluated 27 times, compared to 108 without the gradient.

Newton and quasi-newton methods#

Newton methods: using the Hessian (2nd differential)#

Newton methods use a local quadratic approximation to compute the jump direction. For this purpose, they rely on the 2 first derivative of the function: the gradient and the Hessian.

An ill-conditioned quadratic function:

Note that, as the quadratic approximation is exact, the Newton method is blazing fast

../../_images/de816d2c28f14554e0d604ed5098624e11eeaaf9c998cf3c6774f84b203c9f28.png

../../_images/c2d1322633e17fa195eb41a9547e87288ff2922fb93f242ccf723e9d713948db.png

An ill-conditioned non-quadratic function:

Here we are optimizing a Gaussian, which is always below its quadratic approximation. As a result, the Newton method overshoots and leads to oscillations.

../../_images/a128e4bc08bb44d31ca14f8c45952ee5ecc7c196bfa7a527fd1d71530f2c67b6.png

../../_images/9ceefd1f998c2f79f682149796c9e68e7cf242199f59db562cca998e6bbfb1da.png

An ill-conditioned very non-quadratic function:

../../_images/81e8e6c69272a3fd971574a959e1a3a928c0c9f3efb8be1c85fc7260e3ed3d35.png

../../_images/57a2ef2094cad17fe22fbfef9790f87f44fee9039653320cd86babf05442779c.png

Plot code

See gradient descent plots.

In SciPy, you can use the Newton method by setting method to Newton-CG in scipy.optimize.minimize(). Here, CG refers to the fact that an internal inversion of the Hessian is performed by conjugate gradient.

def f(x):   # The rosenbrock function
    return .5*(1 - x[0])**2 + (x[1] - x[0]**2)**2

def jacobian(x):
    return np.array((-2*.5*(1 - x[0]) - 4*x[0]*(x[1] - x[0]**2), 2*(x[1] - x[0]**2)))

sp.optimize.minimize(f, [2,-1], method="Newton-CG", jac=jacobian)

 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: 1.5601357400786612e-15
       x: [ 1.000e+00  1.000e+00]
     nit: 10
     jac: [ 1.058e-07 -7.483e-08]
    nfev: 11
    njev: 33
    nhev: 0

Note that compared to a conjugate gradient (above), Newton’s method has required less function evaluations, but more gradient evaluations, as it uses it to approximate the Hessian. Let’s compute the Hessian and pass it to the algorithm:

def hessian(x): # Computed with sympy
    return np.array(((1 - 4*x[1] + 12*x[0]**2, -4*x[0]), (-4*x[0], 2)))

sp.optimize.minimize(f, [2,-1], method="Newton-CG", jac=jacobian, hess=hessian)

 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: 1.6277298383706738e-15
       x: [ 1.000e+00  1.000e+00]
     nit: 10
     jac: [ 1.110e-07 -7.781e-08]
    nfev: 11
    njev: 11
    nhev: 10

Note

At very high-dimension, the inversion of the Hessian can be costly and unstable (large scale > 250).

Note

Newton optimizers should not to be confused with Newton’s root finding method, based on the same principles, scipy.optimize.newton().

Quasi-Newton methods: approximating the Hessian on the fly#

BFGS: BFGS (Broyden-Fletcher-Goldfarb-Shanno algorithm) refines at each step an approximation of the Hessian.

An ill-conditioned quadratic function:

On a exactly quadratic function, BFGS is not as fast as Newton’s method, but still very fast.

../../_images/02b52c2f67b0564389bd5da588d2e8b01c1c3f10d367cd4047cdda9a5a67b221.png

../../_images/7c5606ce6036e356f5ae7653aa92c43ed7039a23910752ff9381f44d335ecdbd.png

An ill-conditioned non-quadratic function:

Here BFGS does better than Newton, as its empirical estimate of the curvature is better than that given by the Hessian.

../../_images/20aa2a682e620700774c81ff572572effb5731d2a6f63020320d5d1464dd556e.png

../../_images/220a78d48aa1225558d5dd4534bfd7273b1e678f1062e6fe16ef86e3e8f0f31e.png

An ill-conditioned very non-quadratic function:

../../_images/1609b092c18f0dedcf4e6743a011120adbce171efa66ea681ff6db66796bd323.png

../../_images/1c80ba6fecc113cda313aa165b3a2b2639c5473740f4a88df826d5141f801371.png

Plot code

See gradient descent plots.

def f(x):   # The rosenbrock function
    return .5*(1 - x[0])**2 + (x[1] - x[0]**2)**2

def jacobian(x):
    return np.array((-2*.5*(1 - x[0]) - 4*x[0]*(x[1] - x[0]**2), 2*(x[1] - x[0]**2)))

sp.optimize.minimize(f, [2, -1], method="BFGS", jac=jacobian)

  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 2.630637192365927e-16
        x: [ 1.000e+00  1.000e+00]
      nit: 8
      jac: [ 6.709e-08 -3.222e-08]
 hess_inv: [[ 9.999e-01  2.000e+00]
            [ 2.000e+00  4.499e+00]]
     nfev: 10
     njev: 10

L-BFGS: Limited-memory BFGS sits between BFGS and conjugate gradient: in very high dimensions (> 250) the Hessian matrix is too costly to compute and invert. L-BFGS keeps a low-rank version. In addition, box bounds are also supported by L-BFGS-B:

def f(x):   # The rosenbrock function
    return .5*(1 - x[0])**2 + (x[1] - x[0]**2)**2

def jacobian(x):
    return np.array((-2*.5*(1 - x[0]) - 4*x[0]*(x[1] - x[0]**2), 2*(x[1] - x[0]**2)))

sp.optimize.minimize(f, [2, 2], method="L-BFGS-B", jac=jacobian)

  message: CONVERGENCE: NORM OF PROJECTED GRADIENT <= PGTOL
  success: True
   status: 0
      fun: 1.441767747301186e-15
        x: [ 1.000e+00  1.000e+00]
      nit: 16
      jac: [ 1.023e-07 -2.593e-08]
     nfev: 17
     njev: 17
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>

Gradient-less methods#

A shooting method: the Powell algorithm#

Almost a gradient approach:

An ill-conditioned quadratic function:

Powell’s method isn’t too sensitive to local ill-conditionning in low dimensions

../../_images/c7828b16759f7eca327f8fd6a1fb20ac57dd44b878be5dcdec7104152b41b353.png

../../_images/b4ff272d9c0b286e26d6b5060077e9283678f9e4f18d416dff74da3631129f28.png

An ill-conditioned very non-quadratic function:

../../_images/9a6b90c14dda968803a85a1c321a0e1d4cff114edfd82bd50a21e73aaed07c30.png

../../_images/3c88bf709f2d9a4ad48a7f766be16b9e74363d3cf3a88d40a743b5127054ae59.png

Plot code

See gradient descent plots.

Simplex method: the Nelder-Mead#

The Nelder-Mead algorithms are a generalization of dichotomy approaches to high-dimensional spaces. The algorithm works by refining a simplex, the generalization of intervals and triangles to high-dimensional spaces, to bracket the minimum.

Strong points: it is robust to noise, as it does not rely on computing gradients. Thus it can work on functions that are not locally smooth such as experimental data points, as long as they display a large-scale bell-shape behavior. However it is slower than gradient-based methods on smooth, non-noisy functions.

An ill-conditioned non-quadratic function:
An ill-conditioned very non-quadratic function:

Plot code

See gradient descent plots.

Using the Nelder-Mead solver in scipy.optimize.minimize():

def f(x):   # The rosenbrock function
    return .5*(1 - x[0])**2 + (x[1] - x[0]**2)**2

sp.optimize.minimize(f, [2, -1], method="Nelder-Mead")

       message: Optimization terminated successfully.
       success: True
        status: 0
           fun: 1.11527915993744e-10
             x: [ 1.000e+00  1.000e+00]
           nit: 58
          nfev: 111
 final_simplex: (array([[ 1.000e+00,  1.000e+00],
                       [ 1.000e+00,  1.000e+00],
                       [ 1.000e+00,  1.000e+00]]), array([ 1.115e-10,  1.537e-10,  4.988e-10]))

Global optimizers#

If your problem does not admit a unique local minimum (which can be hard to test unless the function is convex), and you do not have prior information to initialize the optimization close to the solution, you may need a global optimizer.

Brute force: a grid search#

scipy.optimize.brute() evaluates the function on a given grid of parameters and returns the parameters corresponding to the minimum value. The parameters are specified with ranges given to numpy.mgrid. By default, 20 steps are taken in each direction:

def f(x):   # The rosenbrock function
    return .5*(1 - x[0])**2 + (x[1] - x[0]**2)**2

sp.optimize.brute(f, ((-1, 2), (-1, 2)))

array([1.00001462, 1.00001547])

Practical guide to optimization with SciPy#

Choosing a method#

All methods are exposed as the method argument of scipy.optimize.minimize().

../../_images/674f09f6af5118743671c26a6660d4b4a48448deda2cb439e2b5170848fb374b.png

Code for plot above

See compare optimizers.

Table 12 **Rules of thumb for choosing a method**#
Without knowledge of the gradient	In general, prefer BFGS or L-BFGS, even if you have to approximate numerically gradients. These are also the default if you omit the parameter `method` - depending if the problem has constraints or bounds. On well-conditioned problems, Powell and Nelder-Mead, both gradient-free methods, work well in high dimension, but they collapse for ill-conditioned problems.
With knowledge of the gradient	BFGS or L-BFGS. Computational overhead of BFGS is larger than that L-BFGS, itself larger than that of conjugate gradient. On the other side, BFGS usually needs less function evaluations than CG. Thus conjugate gradient method is better than BFGS at optimizing computationally cheap functions.
With the Hessian	If you can compute the Hessian, prefer the Newton method (Newton-CG or TCG).
If you have noisy measurements	Use Nelder-Mead or Powell.

Making your optimizer faster#

Choose the right method (see above), do compute analytically the gradient and Hessian, if you can.
Use preconditionning when possible.
Choose your initialization points wisely. For instance, if you are running many similar optimizations, warm-restart one with the results of another.
Relax the tolerance if you don’t need precision using the parameter tol.

Computing gradients#

Computing gradients, and even more Hessians, is very tedious but worth the effort. Symbolic computation with Sympy may come in handy.

Warning

A very common source of optimization not converging well is human error in the computation of the gradient. You can use scipy.optimize.check_grad() to check that your gradient is correct. It returns the norm of the different between the gradient given, and a gradient computed numerically:

sp.optimize.check_grad(f, jacobian, [2, -1])

np.float64(2.384185791015625e-07)

See also scipy.optimize.approx_fprime() to find your errors.

Synthetic exercises#

A simple (?) quadratic function

Exercise 55

Optimize the following function, using K[0] as a starting point:

rng = np.random.default_rng(27446968)
K = rng.normal(size=(100, 100))

def f(x):
    return np.sum((K @ (x - 1))**2) + np.sum(x**2)**2

Time your approach. Find the fastest approach. Why is BFGS not working well?

Solution to Exercise 55

Alternating optimization

The challenge here is that Hessian of the problem is a very ill-conditioned matrix. This can easily be seen, as the Hessian of the first term in simply 2 * K.T @ K. Thus the conditioning of the problem can be judged from looking at the conditioning of K.

import time

rng = np.random.default_rng(27446968)

K = rng.normal(size=(100, 100))

def f(x):
    return np.sum((K @ (x - 1)) ** 2) + np.sum(x**2) ** 2

def f_prime(x):
    return 2 * K.T @ K @ (x - 1) + 4 * np.sum(x**2) * x

def hessian(x):
    H = 2 * K.T @ K + 4 * 2 * x * x[:, np.newaxis]
    return H + 4 * np.eye(H.shape[0]) * np.sum(x**2)

Some pretty plotting

plt.figure()
Z = X, Y = np.mgrid[-1.5:1.5:100j, -1.1:1.1:100j]  # type: ignore[misc]
# Complete in the additional dimensions with zeros
Z = np.reshape(Z, (2, -1)).copy()
Z.resize((100, Z.shape[-1]))
Z = np.apply_along_axis(f, 0, Z)
Z = np.reshape(Z, X.shape)
plt.imshow(Z.T, cmap="gray_r", extent=(-1.5, 1.5, -1.1, 1.1), origin="lower")
plt.contour(X, Y, Z, cmap="gnuplot")

<matplotlib.contour.QuadContourSet at 0x110ba36b0>

../../_images/95d33d873481457f57a545ccfc008041d126600b97d8315694c2ba8f8b002c5e.png

A reference but slow solution:

t0 = time.time()
x_ref = sp.optimize.minimize(f, K[0], method="Powell").x
print(f"     Powell: time {time.time() - t0:.2f}s")
f_ref = f(x_ref)

     Powell: time 0.16s

Compare different approaches

t0 = time.time()
x_bfgs = sp.optimize.minimize(f, K[0], method="BFGS").x
print(
    f"       BFGS: time {time.time() - t0:.2f}s, x error {np.sqrt(np.sum((x_bfgs - x_ref) ** 2)):.2f}, f error {f(x_bfgs) - f_ref:.2f}"
)

t0 = time.time()
x_l_bfgs = sp.optimize.minimize(f, K[0], method="L-BFGS-B").x
print(
    f"     L-BFGS: time {time.time() - t0:.2f}s, x error {np.sqrt(np.sum((x_l_bfgs - x_ref) ** 2)):.2f}, f error {f(x_l_bfgs) - f_ref:.2f}"
)

       BFGS: time 0.71s, x error 0.02, f error -0.03
     L-BFGS: time 0.05s, x error 0.02, f error -0.03

t0 = time.time()
x_bfgs = sp.optimize.minimize(f, K[0], jac=f_prime, method="BFGS").x
print(
    f"  BFGS w f': time {time.time() - t0:.2f}s, x error {np.sqrt(np.sum((x_bfgs - x_ref) ** 2)):.2f}, f error {f(x_bfgs) - f_ref:.2f}"
)

t0 = time.time()
x_l_bfgs = sp.optimize.minimize(f, K[0], jac=f_prime, method="L-BFGS-B").x
print(
    f"L-BFGS w f': time {time.time() - t0:.2f}s, x error {np.sqrt(np.sum((x_l_bfgs - x_ref) ** 2)):.2f}, f error {f(x_l_bfgs) - f_ref:.2f}"
)

  BFGS w f': time 0.08s, x error 0.02, f error -0.03
L-BFGS w f': time 0.00s, x error 0.02, f error -0.03

t0 = time.time()
x_newton = sp.optimize.minimize(
    f, K[0], jac=f_prime, hess=hessian, method="Newton-CG"
).x
print(
    f"     Newton: time {time.time() - t0:.2f}s, x error {np.sqrt(np.sum((x_newton - x_ref) ** 2)):.2f}, f error {f(x_newton) - f_ref:.2f}"
)

     Newton: time 0.01s, x error 0.02, f error -0.03

A locally flat minimum

Exercise 56

Consider the function exp(-1/(.1*x**2 + y**2). This function admits a minimum in (0, 0). Starting from an initialization at (1, 1), try to get within 1e-8 of this minimum point.

This exercise is hard because the function is very flat around the minimum (all its derivatives are zero). Thus gradient information is unreliable.

Solution to Exercise 56

Finding a minimum in a flat neighborhood

The function admits a minimum in [0, 0]. The challenge is to get within 1e-7 of this minimum, starting at x0 = [1, 1].

The solution that we adopt here is to give up on using gradient or information based on local differences, and to rely on the Powell algorithm. With 162 function evaluations, we get to 1e-8 of the solution.

def f(x):
    return np.exp(-1 / (0.01 * x[0] ** 2 + x[1] ** 2))

A well-conditioned version of f:

def g(x):
    return f([10 * x[0], x[1]])

The gradient of g. We won’t use it here for the optimization.

def g_prime(x):
    r = np.sqrt(x[0] ** 2 + x[1] ** 2)
    return 2 / r**3 * g(x) * x / r

result = sp.optimize.minimize(g, [1, 1], method="Powell", tol=1e-10)
x_min = result.x
x_min

array([ 1.00165881e-08, -9.59695391e-09])

Some pretty plotting:

t = np.linspace(-1.1, 1.1, 100)
plt.plot(t, f([0, t]));

../../_images/ed70b1c4409b4ee2a7964e48f38457c8f45a264b9635064b8205baa7a31290eb.png

X, Y = np.mgrid[-1.5:1.5:100j, -1.1:1.1:100j]  # type: ignore[misc]
plt.imshow(f([X, Y]).T, cmap="gray_r", extent=(-1.5, 1.5, -1.1, 1.1), origin="lower")
plt.contour(X, Y, f([X, Y]), cmap="gnuplot")

# Plot the gradient
dX, dY = g_prime([0.1 * X[::5, ::5], Y[::5, ::5]])
# Adjust for our preconditioning
dX *= 0.1
plt.quiver(X[::5, ::5], Y[::5, ::5], dX, dY, color=".5")

# Plot our solution
plt.plot(x_min[0], x_min[1], "r+", markersize=15);

../../_images/99f2968bec2e5ff7dda0d00eb85c37fb68d088ad1f63cc8a87de01ee626d90bd.png

Special case: non-linear least-squares#

Minimizing the norm of a vector function#

Least square problems, minimizing the norm of a vector function, have a specific structure that can be used in the Levenberg–Marquardt algorithm implemented in scipy.optimize.leastsq().

Lets try to minimize the norm of the following vectorial function:

def f(x):
    return np.arctan(x) - np.arctan(np.linspace(0, 1, len(x)))

x0 = np.zeros(10)
sp.optimize.leastsq(f, x0)

(array([0.        , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
        0.55555556, 0.66666667, 0.77777778, 0.88888889, 1.        ]),
 4)

This took 67 function evaluations (check it with ‘full_output=True’). What if we compute the norm ourselves and use a good generic optimizer (BFGS):

def g(x):
    return np.sum(f(x)**2)

result = sp.optimize.minimize(g, x0, method="BFGS")
result.fun

np.float64(2.694080970246647e-11)

BFGS needs more function calls, and gives a less precise result.

Note

leastsq is interesting compared to BFGS only if the dimensionality of the output vector is large, and larger than the number of parameters to optimize.

Warning

If the function is linear, this is a linear-algebra problem, and should be solved with scipy.linalg.lstsq().

Curve fitting#

Least square problems occur often when fitting a non-linear to data. While it is possible to construct our optimization problem ourselves, SciPy provides a helper function for this purpose: scipy.optimize.curve_fit():

def f(t, omega, phi):
    return np.cos(omega * t + phi)

x = np.linspace(0, 3, 50)
rng = np.random.default_rng(27446968)
y = f(x, 1.5, 1) + .1*rng.normal(size=50)

sp.optimize.curve_fit(f, x, y)

(array([1.48121235, 0.99992781]),
 array([[ 0.00033369, -0.00049995],
        [-0.00049995,  0.00109915]]))

../../_images/a33607d12467bf1b019292f9c40e9b686cfa58d27eb6b7805f41b2bb34a96d00.png

Exercise 57

Do the same with omega = 3. What is the difficulty?

Optimization with constraints#

Box bounds#

Box bounds correspond to limiting each of the individual parameters of the optimization. Note that some problems that are not originally written as box bounds can be rewritten as such via change of variables. Both scipy.optimize.minimize_scalar() and scipy.optimize.minimize() support bound constraints with the parameter bounds:

def f(x):
    return np.sqrt((x[0] - 3)**2 + (x[1] - 2)**2)

sp.optimize.minimize(f, np.array([0, 0]), bounds=((-1.5, 1.5), (-1.5, 1.5)))

  message: CONVERGENCE: NORM OF PROJECTED GRADIENT <= PGTOL
  success: True
   status: 0
      fun: 1.5811388300841898
        x: [ 1.500e+00  1.500e+00]
      nit: 2
      jac: [-9.487e-01 -3.162e-01]
     nfev: 9
     njev: 3
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>

../../_images/c681c4f9fe1a67c579fe3484f9ffbcd3a83c45b22b2887594129afb695e3bf45.png

Plot code

See constraint plots.

General constraints#

Equality and inequality constraints specified as functions: \(f(x) = 0\) and \(g(x) < 0\).

`scipy.optimize.fmin_slsqp()` Sequential least square programming: equality and inequality constraints#

../../_images/5d16237db2767222f2e4609573cdd358272f75968c23ad230cadb5afa2b2d306.png

Plot code

See constraint non-bounds.

def f(x):
    return np.sqrt((x[0] - 3)**2 + (x[1] - 2)**2)

def constraint(x):
    return np.atleast_1d(1.5 - np.sum(np.abs(x)))

x0 = np.array([0, 0])
sp.optimize.minimize(f, x0, constraints={"fun": constraint, "type": "ineq"})

 message: Optimization terminated successfully
 success: True
  status: 0
     fun: 2.4748737350439685
       x: [ 1.250e+00  2.500e-01]
     nit: 5
     jac: [-7.071e-01 -7.071e-01]
    nfev: 15
    njev: 5

Warning

The above problem is known as the Lasso problem in statistics, and there exist very efficient solvers for it (for instance in scikit-learn). In general do not use generic solvers when specific ones exist.

Lagrange multipliers

If you are ready to do a bit of math, many constrained optimization problems can be converted to non-constrained optimization problems using a mathematical trick known as Lagrange multipliers.

Mathematical optimization: finding minima of functions

Contents

Mathematical optimization: finding minima of functions#

Knowing your problem#

Convex versus non-convex optimization#

Smooth and non-smooth problems#

Noisy versus exact cost functions#

Constraints#

A review of the different optimizers#

Getting started: 1D optimization#

Gradient based methods#

Some intuitions about gradient descent#

Conjugate gradient descent#

Newton and quasi-newton methods#

Newton methods: using the Hessian (2nd differential)#

Quasi-Newton methods: approximating the Hessian on the fly#

Gradient-less methods#

A shooting method: the Powell algorithm#

Simplex method: the Nelder-Mead#

Global optimizers#

Brute force: a grid search#

Practical guide to optimization with SciPy#

Choosing a method#

Making your optimizer faster#

Computing gradients#

Synthetic exercises#

Special case: non-linear least-squares#

Minimizing the norm of a vector function#

Curve fitting#

Optimization with constraints#

Box bounds#

General constraints#

scipy.optimize.fmin_slsqp() Sequential least square programming: equality and inequality constraints#

`scipy.optimize.fmin_slsqp()` Sequential least square programming: equality and inequality constraints#