CPSC 536H: Empirical Algorithmics (Spring 2010)
Notes by Holger H. Hoos, University of British Columbia

---------------------------------------------------------------------------------------
Module 7: Automated Parameter Tuning and Algorithm Configuration
---------------------------------------------------------------------------------------

[incomplete, preliminary version - make sure to download the complete and revised version from the course web page later!]


---
7.1 Tuning and configuration as optimisation

Many high-performance algorithms have parameters.
[ask students: why?]

Important problem: 
Finding good parameter settings for given application scenario.
=> optimisation problem!

In this module, consider the following algorithm configuration
or parameter tuning problem (except for Section 7.6):

Given:
- parameterised target algorithm A
- configuration space C (= space of possible parameter settings of A)
- set (or distribution) of problem instances I
- performance metric m

Want:
- configuration c* \in C for which A shows best performance on I
  -> c* \in argmax_{c \in C} m(A(c),I)
	where A(c) is A under config c, 
	m(A,I) is performance measure m of A on I

Note:
- m could be run-time, solution quality, error, ...
	could capture peak performance, robustness, ...
- Real interest is in configuration c* for which not just best 
performance on I is obtained, but on instances like those in I
	(unknown target distribution)
=> machine learning flavour: use I as training data, 
	assess generalisation of performance on separate test data
	(or by cross-validation)
	success depends on how representative I is of the unknown
	targe distribution
	-> risk of "overtuning" to training set (cf overfitting 
	in machine learning)
- special case:
    parameter tuning = local optimisation of relatively few, 
	continuous parameters 	
    (note: in the literature, parameter tuning and algorithm 
	configuration are often	used synonymously)

Types of parameters:
- continuous (real-valued)
- integer / ordinal
- categorical (including boolean)
=> different optimisation methods required

Note:
- certain parameter combinations may be forbidden
- some parameters may only be active for certain settings
	of other parameters (conditional parameters)


Variations of this problem:
- find good parameter settings for a specific instance to be solved
  based on properties (= features) of this instance
  -> per-instance algorithm configuration (not covered here, but closely 
	related to per-instance algorithm selection (-> module 9)
- modify parameter settings during the run of an algorithm,
	in response to properties of the run, progress indicators
	(-> adaptive tuning, reactive search; 
		section 7.6, time permitting)
- life-long learning: in an application situation,
	update instance set I and (re-)optimise A on on ongoing basis
	(parallel to actual use or during idle time, e.g., over night)


Main question: How to find optimal (or good) configuratins c*?
[ask students]

Note: 
- this would be easy if parameter effects were independent
[ask students: why?]
  however, parameter effects are typically *not* independent
- further challenges:
	- discrete parameters
	- local optima in configuration space
	- many parameters -> high-dimensional configuration spaces 
	- expensive evaluation of configurations
	  (need to run algorithm on multiple instances, perhaps multiple
		times - for randomised algorithms)
  -> standard optimisation techniques typically don't work
- but: there are relatively simple cases
	- parameter tuning (as defined above) - can often be done
	with continuous optimisation methods, e.g., 
	Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES)
	by Hansen & Ostermeier 2001; can be effective up to ~10 param
	when configurations can be evaluated rel. fast
	- rel. small, discrete (or discretised) configuration spaces
	and cheap evaluations -> basic experimental design procedures (7.2)

Because algorithms with many parameters tend to be difficult to configure,
many designers strive to keep the number of algorithm parameters as small as possible (Occam's razor in design, KISS = Keep it Simple, Stupid)
- we will see later that it can make sense to invert that bias
(programming by optimisation, see end of course)



---
7.2 Basic experimental design procedures

experimental design (aka design of experiments, DoE): 
  area of statistics concerned with the design
of design of information-gathering processes (experiments)
where variation is present, whether under the full control 
of the experimenter or not

here: 
selection of algorithm configurations at which to
evaluate a target algorithm compute in order to achieve specific
goals, such as 
- modelling the parameter response (= performance as a function 
	of parameter settings)
- finding optimal (or very good) configurations
	(note: typically easier when we have a good model
	[ask students: why?])
slightly more general: consider arbitrary inputs, not just algorithm parameters
	[ask students: which other inputs may we be interest 
	in emp alg?]


standard terminology:

experimental region: 
- set of (combinations of) input values for which we wish to study or 
model response
	here: configuration space
- point in experimental region = specific set of input values
	here: configuration
- experimental design: set of points in experimental region for
which we compute the response
	here: set of configurations to evaluate target algorithm on
- factor: input variable (that may affect response)
	here: parameter
  - controlled factor: input controlled by the experimenter
	(as is typically the case for algorithm parameters)
  - uncontrolled factor: input not controlled by experimenter
	(in our case, e.g., seed of PRNG, instance properties)
- factor level: value of input variable considered in design
	here: parameter value

Note: in physical experiments (or any other scenario with
	uncontrolled factors) experimental design is complicated by
- noise (non-systematic effect of uncontrolled factors)
- bias (systematic effect of uncontrolled factors)

=> Classical experimental design uses
- replication and blocking to control for noise
- randomisation to control for bias

Additional complications can arise from:
- correlated inputs (e.g., collinearity)
	here: parameter dependencies
- incorrect assumptions in the statistic model of the relation
between inputs and response (model bias)

Experimental design methods are used to address these
problems:
- orthogonal design: use of uncorrelated input values makes it
possible to independently assess effects of individual inputs on
response (see also factorial designs)
- designs for model bias + use of diagnostics (e.g., scatter plots,
quantile plots) can protect against certain types of bias


Principle in experimental design, motivated by goal to model
response:

Designs should provide information about all portions of
experimental region (if there is no prior knowledge /
assumptions about true relationship between inputs and
response).
=> use space-filling designs, i.e., designs that spread points evenly throughout experimental region

Also: 
- predictors for response are often based on interpolators 
- prediction error at any point is relative to its distance from
clostest design point
- uneven designs can yield predictors that are very inaccurate in
sparsely observed parts of experimental region


Simple designs often discretise continuous factors using a 
regular grid over experimental region

Note: in all of the following, efficient parallelisation is possible 
  (par evaluation of set of configs, of single config on set of inst)


Simple design methods:
- full factorial design: consider all combinations of factor levels
  -> complete search of configuration space
[typically impractical - ask students: why?]

- simple (uniform) random sampling
	problem: for small samples in high-dimensional regions 
		-> clustering, poorly covered areas

- stratified random sampling:
    divide region into n strata (spread evenly), sample one point
    randomy select one point from each stratum

Even better: Latin Hypercube Designs

--
Latin Hypercube Designs (LHDs)

Motivation: want good coverage of large region with relatively few points

How to construct an LHD with n points for two continuous
parameters:
1. partition experimental region into a square with n^2 cells (n
along each dimension)
2. labels the cells with integers from {1, ... , n} such that a Latin
square is obtained
(in a Latin square, each integer occurs exactly once in each
row and column)
3. select one of the integers, say i, at random
4. sample one point from each cell labelled with i

[illustrate on board]

General procedure for constructing an LHD of size n given d
continuous, independent inputs:
1. divide domain of each input into n intervals
2. construct an n x d matrix M whose columns are different
randomly selected points permutations of {1, ... , n}
3. each row of M corresponds to a cell in the hyper-rectangle
induced by the interval partitioning from Step 1;
sample one point from each of these cells (for deterministic
inputs: centre of each cell)


Note: LHDs need not be space-filling (i.e., may have uneven coverage
  of region)

Potential remedies:
- randomised orthogonal array designs: ensure that
u-dimensional projections of points (for u = 1, ..., t) are
regular grids; drawback: exist only for certain values of n and t
- cascading LHDs: construct secondary LHDs for small regions
around points of primary LHD
- generate & test: construct many LHds and use additional criteria 
to select ‘good’ LHD 
  (general g&t approach can also be applied to designs 
	obtained from simple or stratified random sampling)


Other advanced methods (not covered in detail here, but 
see, e.g., Santner et al., 2003):
- distance-based designs (ensure spread of points)
- uniform Designs (ensure uniformity of covering within exp region)
- designs satisfying multiple criteria

-> difficult to construct, often use heuristic search / optimisation methods
also: hardly used for algorithm configuration


Further reading:
- Chapter 5 of T.J. Santner et al.: The Design and Analysis of Computer Experiments, Springer, 2003. 


---
7.3 Racing approaches

key idea behind racing [ask students]:

- sequentially evaluate given set of algorithms/configurations
- in each iteration, perform one new run per candidate
- eliminate poorly performing candidates from the race 
as soon as sufficient evidence is gathered against them


--
Hoeffding races (Maron & Moore, 1994):

[named after Wassily Hoeffding, finish-born American statistician
known as one of the founding fathers of the nonparametric statistics;
1914-1991]

application context:
- original paper: selecting a good model (one with minimal error)
for classification, function approx
- but: directly applies to much wider class of algorithms,
including ones obtained from parameterised alg + set of configs

idea:
- given set S of blackbox (learning) alg ("learning box")
	"test set" T of size N
- with each alg, associate number of test instances evaluated,
	estimate of error rate (=performance) 
- start the race with all algorithms from S
- in each iteration of the Hoeffding race procedure,
  - randomly select in instance t from T
  - run each alg that are still (set S') in the race on t
  - update error estimates for all alg in S'
- terminate after fixed number of iterations or when only one alg left

[draw illustration - see also paper, Fig.2]


some details:
- error estimates are obtained from Hoeffding's formula
	which bounds deviation of true from estimated mean 
	error for given confidence 1-delta,
	bound B on error on indiv instances
	[see p.3 of paper]
- confidence 1-delta s chosen by user (e.g., 1-0.01/(n*m),
	where m = |S| = number of alg, n = number of test inst
- bound B is interval of observed error values for given alg

note:
- provably correct with given (user-defined) confidence
	(means that probability of returning set of alg that 
	does not include the best one is upper-bounded)
  (proof: not hard, see paper)

note:
- paper also mentions hillclimbing (for continous parameters)
	- we'll discuss this later


--
F-Race (Birattari & al, 2002)

[named after Milton Friedman - American economist, statistician and Nobel laureate; 1912-2006]

inspired by Hoeffding race

key idea:
- sequentially evaluate algorithms/configuration,
	in each iteration, perform one new run per algorithm/configuration
- eliminate poorly performing algorithms/configurations
	as soon as sufficient evidence is gathered against them
- use Friedman test (aka: Friedman two-way analysis of variance 
	by ranks) to detect poorly performing algorithms/configurations

details:
- Friedman test assesses whether m configurations show no significant 
	performance differences on n given instances (null hyp)
- if null rejected, i.e., some configurations 
	perform better than others, perfom series of pairwise 
	post hoc tests between the incumbent and all other configs
- drop all configurations from race for which pairwise tests
	indicate significantly worse performance than incumbent

note:
- Friedman test is non-parametric, based on ranking configurations
	on each instance
- ranking separately on each instance amounts to blocking
	(well-known variance reduction technique)
- no proof of correctness (with bounded error) given,
	but possible (would require multiple testing correction)


--
Iterated F-Race (Balaprakash & al., 2007)

key limitation of F-Race: 
initial stages of procedure need to evaluate all given configurations
-> infeasible for large configuration spaces

solutions [ask students]
- random sampling of configs, followed by F-Race 
  -> RSD/F-Race
- interleaved sampling of configs, selection by F-Race,
	where sampling is focussed on promising configs
  -> Iterated F-Race = I/F-Race

Some details on I/F-Race:
- three stages per iteration:
  1. sample fixed number of configs based on (simplistic) probalistic model
  2. perform std F-Race on configs from 1
	(disjunctive termination criterion: min # configs, max time, 
	min # instances)
  3. update probabilistic model based on winning configs from 2
- first iteration: sample using indep uniform distributions
	(as iN RSD/F-Race)
- prob model = one k-variate normal distribution (k = number of params)
	per config from stage 2; 
	covariance 0 => k independent univariate normal distr
- means = params for the configs from which distr is obtained
- volume reduction: gradual, geometric decrease of variance
	=> samples in later iterations increasingly focussed 
	near configs already shown to be good (intensification)
- sampling (stage 1) after first iteration: for each config,
	follow 2-stage process:
	1. choose one of the prob models with prob depending on 
		overall performance rank
	2. sample one config from that model
- input to stage 2 = configs from stage 2 of previous iteration
	+ newly sampled configs from stage 1,
	total number = N_max (param of I/F-Race)

--
note: RSD/F-Race, I/F-Race can deal with continuous parameters

--
Successul applications of F-Race and variants:
- MAX-MIN Ant System for the TSP 
	(6 parameters)
- simulated annealing for vehicle routing with stochastic demands
	(4 parameters)
- estimation-based local search for PTSP = probabilistic TSP
	(3 parameters)
(resulted in new state of the art alg on the last two of these)


--
Further details: 
- Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation. O. Maron and Andrew W. Moore, In Advances in neural information processing systems 6, pp.59-66, 1994.
- A Racing Algorithm for Configuring Metaheuristics. M. Birattari, T. Stuetzle, L. Paquete and K. Varrentrapp, Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pp.11-18, 2002. 


---
7.4 Model-free search and genetic programming 

--
BasicILS

[presentation by Shervin; see Hutter et al. 2007, 2009]


--
FocusedILS

[presentation by Shervin; see Hutter et al. 2007, 2009]


--
Genetic programming  / CLASS (only covered very briefly)

general idea:
- use stochastic local search in a space of algorithms (programs, automata, grammars)
to find one that achieves a certain goal, here: optimised performance on 
given (training) instances
- here: focus on evolutionary algorithms - important and widely used
class of SLS algorithms inspired by (typically simple) models of evolution

evolutionary algorithms in a nutshell:
- use populations of candidate solutions (here: algorithms)
- iteratively apply genetic operators:
  mutation, recombination, selection
  [illustrate briefly on the board, for details see SLS:FA, Ch.2]
- fitness = performance measure to be optimised

genetic programming often/typically used tree-based representations
  of the object being constructed

[briefly discuss mutation, recombination on trees]

examples: 
- tree-based representation of an algorithm for the "snake game"
  -> http://www.gamedev.net/reference/articles/article1175.asp
- tree-based representation of a heuristic function (used inside 
  an algorithm, e.g., for SAT) -> Fukunaga (2004)

some other successful applications of genetic programming:
  quantum computing, electronic design, game playing, sorting, searching
  (see http://www.genetic-programming.com/humancompetitive.html;
   some of the evolved algorithms are patented -> automated invention)

note:
GP widely seen as a machine learning / AI technique, but unfortunately not 
very widely used/regarded by these communities. Still, widely applied paradigm
(including industry), wider area of evolutionary algorithms / evolutionary computation
has its own conferences, journals, ...


--
Further details: 
- Automatic Algorithm Configuration based on Local Search. 
  by Frank Hutter, Holger H. Hoos and Thomas Stuetzle. Proceedings of the 22nd Conference on Artificial Intelligence (AAAI-07), pp. 1152-1157, 2007. 
- ParamILS: An Automatic Algorithm Configuration Framework.
  by Frank Hutter, Holger H. Hoos, Kevin Leyton-Brown and Thomas Stützle - Journal of Artificial Intelligence Research, Volume 36, pp. 267-306, 2009. 
- Evolving local search heuristics for SAT using genetic programming. 
  by A.S. Fukunaga. In: Genetic and Evolutionary Computation – GECCO-2004, Part II, volume 3103 of Lecture Notes in Computer Science, pages 483–494, Springer-Verlag.



---
7.5 Sequential experimental design procedures

(SPO, SKO and related approaches)

[guest lecture by Frank Hutter - slides will be made available,
 please take your own notes]


---
7.6 Adaptive tuning and reactive search

(aka dynamic algorithm control)

[time permitting - only covered very briefly this time]

key idea: modify algorithm configuration (parameters) while solving a given instance,
	typically based on dynamic analysis of progress / intermediate or
	candidate solutions

note: not blackbox - more problem- and algorithm-specific than previous approaches

two fundamental approaches:
1. exploit problem- or algorithm-specific knowledge to build dynamic control scheme
2. use off-the-shelf machine learning techniques (in particular, reinforcement learning)
  in combination with problem-/algorithm-specific features
(middle ground exists)

note: the mechanisms used for modifying parameters during execution
	can/should be configured using techniques discussed earlier in this module.
  (this includes selection of mechanisms / components as well as setting
   of static parameters - i.e., ones not changing during execution
   of target algorithm - controlling their behaviour)

closely related, but not covered here:
- instance-based algorithm selection (see Module 9)
- dynamic algorithm selection (see Module 9 for pointers)

Some literature:
- Battiti, R. and Protasi, M. (1997). Reactive search, a history-based heuristic for MAX-SAT. ACM
Journal of Experimental Algorithmics, 2.
- Battiti, R., Brunato, M., and Mascia, F. (2008). Reactive Search and Intelligent Optimization, volume
45 of Operations Research/Computer Science Interfaces. Springer Verlag.
- Holger H. Hoos. An Adaptive Noise Mechanism for WalkSAT. Proceedings of the 18th National Conference on Artificial Intelligence (AAAI-02), pp. 655-660, AAAI Press / The MIT Press, 2002. 
(and references therein)	


---
learning goals:
- be able to formally state and explain the algorithm configuration 
	problem for a given set or distribution of instances
- be able to explain why standard optimisation techniques are often
	not applicable to practically solve algorithm configuration 
	problems
- be able to list and explain various types of parameters and to 
	explain their impact on the choice of methods to be used
	for algorithm configuration problems
- be able to explain basic experimental design procedures, 
	particularly including Latin Hypercube designs,
	their use and limitations for algorithm configuration
- be able to explain how racing procedures, including Hoeffding races,
	F-Race and Iterative F-Race solve algorithm configuration problems
	(including algorithmic details)
- be able to explain how model-free methods, in particular, ParamILS
	and genetic programming, solve algorithm configuration problems
- be able to explain sequential experimental design procedures for
	algorithm configuration, and to contrast them
	with model-free methods and basic experimental design methods
- be able to explain the concept of adaptive tuning / reactive search,
	and to explain at least two approaches to implementing it
	at the conceptual level
<eof>