Additional Topics

Optimization

In this section we will study methods for finding maxima and minima of a function f. Of course the first try will always be via calculus, that is finding \(f'\) and solving \(f'(x)=0\). This, though, only works if the function is fairly simple and the zeros of f’ can be found analytically.

There are two types of problems here that are closely related, namely optimization and root finding. Quite often one can be turned into the other:

  • we want to maximize f, then if f is differentiable it has a maximum/minimum at x0 if f’(x0)=0

  • we want to solve the equation f(x)=a, then if x0 is a solution the function g(x)=(f(x)-x0)2 has a minimum.

Also, maximizing the function f is equivalent to minimizing the functions -f or 1/f.

Numerical Optimization

If that is not the case we can try and use numerical methods. The most famous of them is the Newton-Raphson algorithm. It chooses a starting point \(x_0\) and then iteratively calculates \[ x_n = x_{n-1} - H^{-1} \nabla \] where \(H\) is the Hessian matrix and \(\nabla\) is the gradient of f evaluated at \(x_{n-1}\).

Example

say \(f(x,y)=\sin(x)+\sin(y)\), \(0<x,y<\pi\)

f <- function(x, y) sin(x)+sin(y)
x <- seq(0, pi, length=100)
y <- x
z <- matrix(0, 100, 100)
for(i in 1:100) z[ ,i] <- f(x[i], y)
persp(x, y, z)

ij <- which(z == max(z), arr.ind = TRUE)[1, ]
round(c(x[ij[1]], y[ij[2]], z[ij[1], ij[2]]), 2)
## [1] 1.55 1.55 2.00

\[ \begin{aligned} &f(x,y)= \sin (x)+ \sin(y) \\ &\nabla_x = \frac{df}{dx} (x,y)= \cos(x) \\ &\nabla_y =\frac{df}{dy} (x,y)= \cos(y) \\ &H_{1,1} = \frac{d^2f}{dy^2} (x,y)= -\sin(y) \\ &H_{1,2} = H_{2,1} =\frac{d^2f}{dydx} (x,y)= 0 \\ &H_{2,2} = \frac{d^2f}{dx^2} (x,y)= -\sin(x) \\ \end{aligned} \]

newp <- c(1, 1)
repeat {
  oldp <- newp
  grad <- cbind(cos(oldp))
  H <- matrix(c(-sin(oldp[1]), 0, 0, -sin(oldp[2])), 2, 2)
  newp <- oldp - solve(H)%*%grad
  print(round(c(newp, f(newp[1], newp[2])), 4))
  if(sum(abs(oldp-newp))<0.0001) break
}
## [1] 1.642 1.642 1.995
## [1] 1.571 1.571 2.000
## [1] 1.571 1.571 2.000
## [1] 1.571 1.571 2.000

One issue with this algorithm is that if the function has many local minima and/or maxima it is difficult to get it to converge to a global one.

Direct Simulation

Here we randomly pick points in some area, evaluate the function and pick the points which have the maxima

Example

\(f(x,y)= \sin (x)+\sin (y)\) \(0<x,y<\pi\)

x <- runif(100, 0, pi)
y <- runif(100, 0, pi)
z <- matrix(0, 100, 100)
for(i in 1:100) z[ ,i] <- f(x[i], y)
ij <- which(z == max(z), arr.ind = TRUE)[1, ]
round(c(x[ij[1]], y[ij[2]], z[ij[1], ij[2]]), 4)
## [1] 2.9567 0.6282 1.9999

Example

consider the function

\[ f(x)=[ \cos (50x) + \sin (20x)]^2\text{, }x \in [0,1] \]

f <- function(x) (cos(50*x) + sin(20*x))^2
curve(f, 0, 1, n=500)

we see it has many maxima and minima. Here using Newton-Raphson is almost certainly going to fail because the starting point would have to be almost at the maximum.

B <- 1e4
x <- runif(B)
y = f(x)
plot(x, y, pch = ".")

round(c(x[y == max(y)], max(y)), 2)
## [1] 0.38 3.83

This does work, but we do need a lot of U’s because the peak at the maximum is very sharp.

Here is another idea:

Because f is a continuous function on a finite interval there exists a constant c such that \(c \cdot f\) is a density.

Moreover using the Hastings-Metropolis algorithm we don’t even need to know c. 

One problem is to extract the maximum from the generated data. We can use a histogram to estimate the and pick the maximum. In general a non-parametric estimator would be better and would need far fewer points.

B <- 1e4
Xn <- rep(0, B)
Xn[1] <- 0.5
for (i in 2:B) {
  X <- runif(1)
  if (runif(1) < f(X)/f(Xn[i-1])) 
    Xn[i] = X
  else Xn[i] = Xn[i-1]
}
hist(Xn[1000:B], breaks = 100, freq = FALSE, main="")
curve(f, 0, 1, add=TRUE)

a <- hist(Xn[100:B], plot = FALSE)
a$breaks[which.max(a$counts)]
## [1] 0.35

Example

Consider the function

\[ \begin{aligned} f(x,y)=&\\ &(x \sin(20y)+y \sin(20x))^2 \cosh( \sin(10x)x)+\\ &(x \cos(10y)-y \sin(10x)^2 \cosh( \cos(20y)y) \\ &-1 < x,y <1 \end{aligned} \]

f <- function(x, y) 
    (x*sin(20*y) + y*sin(20*x))^2 * acos(sin(10*x)*x) + 
    (x*cos(10*y) - y*sin(10*x))^2 * acos(cos(20*y)*y)
n <- 250
x <- seq(-1, 1, length = n)
y <- x
z <- matrix(0, n, n)
for (i in 1:n) z[i, ] = f(x[i], y)
persp(x, y, z, theta = 100)

using the simple simulation approach is easy:

B <- 1e4
x <- runif(B, -1, 1)
y <- runif(B, -1, 1)
z <- f(x, y)
I <- c(1:B)[which.max(y)]
round(c(x[I], y[I], z[I]), 2)
## [1] -0.75  1.00  4.24

The solution via the histogram/non-parametric estimate again is doable but needs a bit of work.

In some ways a direct simulation approach is a bit wasteful because it is designed to generate points covering the whole support of the density. We however are now interested in areas wgere the density is high, aka near the maximum. The next algorithm accomplished this:

Simulated Annealing

this algorithm was actually also introduced by Metropolis, in the same 1953 paper where he first showed the “Metropolis” version of the Metropolis-Hastings algorithm. The fundamental idea is that a change of scale, called temperature, allows for faster moves on the surface of the function f to maximize, whose negative is called the energy.

Therefore rescaling partially avoids the trapping of the algorithm in local minima/maxima. Given a temperature parameter T0 a sample of

\(\theta_1(T)\), \(\theta_2(T)\),..

is generated from the distribution \[ \pi( \theta)=c \cdot \exp(f(\theta)/T) \] Notice that

\[ \begin{aligned} &\pi'( \theta)=c \cdot \exp(f(\theta)/T)f'(\theta)/T=0\\ &\text{iff}\\ &f'(\theta)=0 \end{aligned} \]

so \(\pi (\theta)\) has a maximum iff \(f( \theta)\) has a maximum.

Moreover even if \(\int f(x)dx=\infty\), \(\int \exp(f(x))dx\) is often finite and so there exists a constant c which makes \(\pi\) a .

Here is one popular version of the simulated annealing algorithm:

  1. simulate Y from a distribution with the same support as f, say with \(g(|y-\theta_i|)\)

  2. accept \(\theta_{i+1}=Y\) with probability \[ p=\min\left\{\exp[(f(Y)-f(\theta_i))/T_i), 1 \right\} \] take \(\theta_{i+1}=\theta_i\) otherwise

  3. update \(T_i\) to \(T_{i+1}\)

Notice the similarities between this algorithm and the Hastings-Metropolis one: in each case we draw observations from a “proposal distribution” which depends on the current state x, and accept it as a new observation for X with a certain probability.

Example

\(f(x)=[\cos(50x)+\sin(20x)]^2\) on [0,1]. For this one implementation of the algorithm is as follows:

at iteration i the algorithm is at \((x^i, f(x^i))\)

  1. simulate \(U \sim U[a^i, b^i]\) where \(a^i=\max(x^i-0.5, 0)\) and \(b^i=\min(x^i+0.05, 1)\)

  2. accept \(x^{i+1}=U\) with probability

\(p^i=\min\left\{\exp[(f(U)-f(x^i))/T_i], 1\right\}\)

otherwise set \(x^{i+1}=x^i\)

  1. set \(T^{i+1}=1/\log(i+1)\)
opt.ex1 <- function(Show=FALSE) {
  f <- function(x) (cos(50*x) + sin(20*x))^2   
  B <- 1e4
  x <- rep(0, B)
  x[1] <-  runif(1)
  y <- rep(f(x[1]), B)
  M <- c(x[1], y[1])
  if(Show) plot(c(0, 1), c(0, 5), 
     type = "n", xlab = "x", ylab = "f")
  for (i in 2:B) {
    U <- runif(1, max(x[i-1]-0.5, 0), min(x[i-1]+0.5, 1))
    y[i] <- f(U)
    p <- min(exp((y[i] - y[i-1])* log(i+1)), 1)
    if(runif(1) < p) 
     x[i] <- U
    else {
      x[i] <- x[i-1]
      y[i] <- y[i-1]
    }
    if(y[i]>M[2]) 
      M <- c(x[i], y[i])
    if(Show)  segments(x[i-1], y[i-1], x[i], y[i])
    }
  M
}
opt.ex1(TRUE)

## [1] 0.5633 3.8325

It is often a good idea to run the routine a number of times, from different (maybe randomly chosen) starting points:

out <- matrix(0, 10, 2)
for(i in 1:10) out[i, ] <- opt.ex1()
out[order(out[, 2]), ]
##         [,1]  [,2]
##  [1,] 0.5633 3.833
##  [2,] 0.5633 3.833
##  [3,] 0.3791 3.833
##  [4,] 0.3792 3.833
##  [5,] 0.5633 3.833
##  [6,] 0.3791 3.833
##  [7,] 0.3791 3.833
##  [8,] 0.5633 3.833
##  [9,] 0.5633 3.833
## [10,] 0.5633 3.833

Example

\[ \begin{aligned} f(x,y)=&\\ &(x \sin(20y)+y \sin(20x))^2 \cosh( \sin(10x)x)+\\ &(x \cos(10y)-y \sin(10x)^2 \cosh( \cos(20y)y) \\ &-1 < x,y <1 \end{aligned} \]

f <- function(x, y) 
    (x*sin(20*y) + y*sin(20*x))^2 * acos(sin(10*x)*x) + 
    (x*cos(10*y) - y*sin(10*x))^2 * acos(cos(20*y)*y)
f_opt <- function(start, eps = 0.025) { 
  x <- rep(start[1], B)
  y <- rep(start[2], B)
  fun <- rep(f(x[1], y[1]), B)
  M <- c(x[1], y[1], fun[1])
  plot(c(-1, 1), c(-1, 1), type = "n", xlab = "x", ylab = "f")
  for (i in 2:B) {
    U1 <- runif(1, max(x[i-1]-0.1, -1), min(x[i-1]+0.1, 1))
    U2 <- runif(1, max(y[i-1]-0.1, -1), min(y[i-1]+0.1, 1))
    fun[i] <- f(U1, U2)
    p <- min(exp((fun[i]-fun[i-1])*eps*log(i+1)), 1)
    if(runif(1)<p) {x[i] <- U1; y[i] <- U2}
    else {
        x[i] <- x[i-1]
        y[i] <- y[i]
        fun[i] <- fun[i-1]
      }
      if(fun[i]>M[3]) 
      M <- c(x[i], y[i], fun[i])
      segments(x[i-1], y[i-1], x[i], y[i])
    }
  M
}
par(mfrow=c(3, 2))
f_opt(c(-0.5, -0.5))
## [1]  0.9882 -0.9593  7.6673
f_opt(c(-0.5, 0.5))
## [1] 0.9972 0.6921 7.0506
f_opt(c(0.5, -0.5))
## [1] -0.9934 -0.9988  9.8849
f_opt(c(0.5, 0.5))
## [1] -0.9946  0.7005  7.0661
f_opt(c(-0.99, -0.95), eps=0.1)
## [1] -1.0000 -0.9955 10.9024
f_opt(c(-0.5, 0.5), eps=0.1)

## [1] 0.9922 0.7150 6.7770

Different values of eps change the temperature T.