notes on continuous optimisation (class 10/2):

(mostly based on James Spall, Introduction to Stochastic Search and Optimisation, Wiley, 2003)

---
motivating examples:

- parameter optimisation
- optimisation of engineering designs (e.g., beam dimensions in bridge design)
- protein structure prediction

---
generall continuous opt problem:
  sln comp = real numbers; s = vector in R^n; g = n-dim continuous function R^n -> R

fundamental issues:
- discretisation (static, adaptive)
- dealing with continuous nbhd's (sampling from cont prob distr)
- gradient vs. gradient-free methods
- convergence against local vs. global opt

---
discretisation:

- Hutter et al. (ParamILS)
- Schulze-Kremer (GAs for Protein Folding)

---
dealing with continuous search neighbourhoods:

1. classical numerical optimisation methods:

- steepest descent (corresponds to backprop for NNs)

  from given cand sln s, move along direction of steepest descent

  step: s_k+1 := s_k - a_k*g'(s_k)

  a_k = step size (aka gain, learning coefficient/rate)
	can be a constant, decaying sequence,
	or determined using line search by solving secondary opt probl of det a_k \in argmin_{a>=0} [g(s_k-a*g'(k)]

  g = eval function; g' = dg/ds = gradient of g

  note:
  - convergence to local optima only; sensitive to transformation and scaling of g
  - still widely used, provides the basis for many advanced methods
  - g' can be difficult to obtain (or approximate)

- Newton-Raphson algorithm (Newton's method):

  idea: step size determined by 'local curvature' of g

  step: s_k+1 := s_k - g''(s_k)^-1*g'(s_k)

  g'' = d^2g/dsds^T = Hessian matrix of g

 note:
  - convergence to local opt only, but typically faster than steepest descent
  - perfect solution for quadratic functions (convergence to s^* in one step), but this is uncommon in practice
  - transform-invariant, unaffected by scaling of g
  - typically, good behaviour close to s^*, poor behaviour (stalling, divergence) when away from s^*
    -> using additional scaling coefficient a_k for Hess. matrix can help stabilise
  - g'' can be difficult to obtain (or approximate)


--
2. direct random search:

idea: use only information on g (not g', g'')

- simple random search (alg a from spall, ch.2):
  - choose s_0 det or uniformly at random
  - generate s' based on s_k by sampling from prob distrib D(s,s_k) = cont. nbh
  - if g(s') < g(s_k), s_k+1 := s', else s_k+1 := s_k
  - termination as usual

  D = uniform distr over S -> uniform random picking
  D = mult.var Gaussian -> analogue of uniform random walk (with prob. exp decaying w/ step size)
  ...

- equivalently: localised random search (alg b from spall, ch.2)
  step: s' := s_k + d_k; with d_k sampled from multivariate prob distr over R^n (e.g., multivar normal)
	d_k should have mean 0 and stddev of each component dependent on interval size for resp. sln component.

- enhanced localized random search (alg c from spall, ch.2; related to alg by Solis and Wetts, 1981)
  step: s' := s_k + d_k + b_k
	where d_k is as above, b_k is a bias vector initialised at 0 and adapted according to search progress

--
[also: nonlin simplex (nelder-mead alg); not dir related to simplex for lin programming, but uses idea of convex hull;
	underlying fminsearch in MATLAB; see Spall, ch.2, 2.4]

--
3. some more complex sls methods:

3.1 SA (Spall, Ch.8)

- as in discrete case, use acceptance Metropolis criterion
- for proposal, sample from distribution over search space (or nbh), 
	e.g., by adding multivariate Gaussian step vector
	(alternative: only change one component at a time, e.g., using univar. Gaussian)

3.2 EAs (Spall, Ch.9+10)

- approach 1: discretisation: code real numbers into bit vectors of desired accuracy (e.g., using gray coding), 
	rest as usual
- approach 2: work directly on cont variables.
	example: Evol. Strategy (no recombination): 
		- mutation by adding multivariate Gaussian step vector (Spall, p.261)
		- produce \lambda offspring by mutation from current pop (e.g., randomly chosen, but
			more complicated schemes, including recombination-like, are possible)
		- elitist selection from resulting old + lambda new individuals
	-> known as (N+lambda)-ES, where N = pop size

--
4. alternate approach: use cont opt as subroutine in hybrid SLS method (which can otherwise use, e.g.,
	discrete steps)

-> schaerf  & di gaspero: application to (financial) portfolio optimisation;
   use quadratic program solver as subroutine

---