CPSC 536H: Empirical Algorithmics (Spring 2010) Notes by Holger H. Hoos, University of British Columbia --------------------------------------------------------------------------------------- Module 7: Automated Parameter Tuning and Algorithm Configuration --------------------------------------------------------------------------------------- [incomplete, preliminary version - make sure to download the complete and revised version from the course web page later!] --- 7.1 Tuning and configuration as optimisation Many high-performance algorithms have parameters. [ask students: why?] Important problem: Finding good parameter settings for given application scenario. => optimisation problem! In this module, consider the following algorithm configuration or parameter tuning problem (except for Section 7.6): Given: - parameterised target algorithm A - configuration space C (= space of possible parameter settings of A) - set (or distribution) of problem instances I - performance metric m Want: - configuration c* \in C for which A shows best performance on I -> c* \in argmax_{c \in C} m(A(c),I) where A(c) is A under config c, m(A,I) is performance measure m of A on I Note: - m could be run-time, solution quality, error, ... could capture peak performance, robustness, ... - Real interest is in configuration c* for which not just best performance on I is obtained, but on instances like those in I (unknown target distribution) => machine learning flavour: use I as training data, assess generalisation of performance on separate test data (or by cross-validation) success depends on how representative I is of the unknown targe distribution -> risk of "overtuning" to training set (cf overfitting in machine learning) - special case: parameter tuning = local optimisation of relatively few, continuous parameters (note: in the literature, parameter tuning and algorithm configuration are often used synonymously) Types of parameters: - continuous (real-valued) - integer / ordinal - categorical (including boolean) => different optimisation methods required Note: - certain parameter combinations may be forbidden - some parameters may only be active for certain settings of other parameters (conditional parameters) Variations of this problem: - find good parameter settings for a specific instance to be solved based on properties (= features) of this instance -> per-instance algorithm configuration (not covered here, but closely related to per-instance algorithm selection (-> module 9) - modify parameter settings during the run of an algorithm, in response to properties of the run, progress indicators (-> adaptive tuning, reactive search; section 7.6, time permitting) - life-long learning: in an application situation, update instance set I and (re-)optimise A on on ongoing basis (parallel to actual use or during idle time, e.g., over night) Main question: How to find optimal (or good) configuratins c*? [ask students] Note: - this would be easy if parameter effects were independent [ask students: why?] however, parameter effects are typically *not* independent - further challenges: - discrete parameters - local optima in configuration space - many parameters -> high-dimensional configuration spaces - expensive evaluation of configurations (need to run algorithm on multiple instances, perhaps multiple times - for randomised algorithms) -> standard optimisation techniques typically don't work - but: there are relatively simple cases - parameter tuning (as defined above) - can often be done with continuous optimisation methods, e.g., Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) by Hansen & Ostermeier 2001; can be effective up to ~10 param when configurations can be evaluated rel. fast - rel. small, discrete (or discretised) configuration spaces and cheap evaluations -> basic experimental design procedures (7.2) Because algorithms with many parameters tend to be difficult to configure, many designers strive to keep the number of algorithm parameters as small as possible (Occam's razor in design, KISS = Keep it Simple, Stupid) - we will see later that it can make sense to invert that bias (programming by optimisation, see end of course) --- 7.2 Basic experimental design procedures experimental design (aka design of experiments, DoE): area of statistics concerned with the design of design of information-gathering processes (experiments) where variation is present, whether under the full control of the experimenter or not here: selection of algorithm configurations at which to evaluate a target algorithm compute in order to achieve specific goals, such as - modelling the parameter response (= performance as a function of parameter settings) - finding optimal (or very good) configurations (note: typically easier when we have a good model [ask students: why?]) slightly more general: consider arbitrary inputs, not just algorithm parameters [ask students: which other inputs may we be interest in emp alg?] standard terminology: experimental region: - set of (combinations of) input values for which we wish to study or model response here: configuration space - point in experimental region = specific set of input values here: configuration - experimental design: set of points in experimental region for which we compute the response here: set of configurations to evaluate target algorithm on - factor: input variable (that may affect response) here: parameter - controlled factor: input controlled by the experimenter (as is typically the case for algorithm parameters) - uncontrolled factor: input not controlled by experimenter (in our case, e.g., seed of PRNG, instance properties) - factor level: value of input variable considered in design here: parameter value Note: in physical experiments (or any other scenario with uncontrolled factors) experimental design is complicated by - noise (non-systematic effect of uncontrolled factors) - bias (systematic effect of uncontrolled factors) => Classical experimental design uses - replication and blocking to control for noise - randomisation to control for bias Additional complications can arise from: - correlated inputs (e.g., collinearity) here: parameter dependencies - incorrect assumptions in the statistic model of the relation between inputs and response (model bias) Experimental design methods are used to address these problems: - orthogonal design: use of uncorrelated input values makes it possible to independently assess effects of individual inputs on response (see also factorial designs) - designs for model bias + use of diagnostics (e.g., scatter plots, quantile plots) can protect against certain types of bias Principle in experimental design, motivated by goal to model response: Designs should provide information about all portions of experimental region (if there is no prior knowledge / assumptions about true relationship between inputs and response). => use space-filling designs, i.e., designs that spread points evenly throughout experimental region Also: - predictors for response are often based on interpolators - prediction error at any point is relative to its distance from clostest design point - uneven designs can yield predictors that are very inaccurate in sparsely observed parts of experimental region Simple designs often discretise continuous factors using a regular grid over experimental region Note: in all of the following, efficient parallelisation is possible (par evaluation of set of configs, of single config on set of inst) Simple design methods: - full factorial design: consider all combinations of factor levels -> complete search of configuration space [typically impractical - ask students: why?] - simple (uniform) random sampling problem: for small samples in high-dimensional regions -> clustering, poorly covered areas - stratified random sampling: divide region into n strata (spread evenly), sample one point randomy select one point from each stratum Even better: Latin Hypercube Designs -- Latin Hypercube Designs (LHDs) Motivation: want good coverage of large region with relatively few points How to construct an LHD with n points for two continuous parameters: 1. partition experimental region into a square with n^2 cells (n along each dimension) 2. labels the cells with integers from {1, ... , n} such that a Latin square is obtained (in a Latin square, each integer occurs exactly once in each row and column) 3. select one of the integers, say i, at random 4. sample one point from each cell labelled with i [illustrate on board] General procedure for constructing an LHD of size n given d continuous, independent inputs: 1. divide domain of each input into n intervals 2. construct an n x d matrix M whose columns are different randomly selected points permutations of {1, ... , n} 3. each row of M corresponds to a cell in the hyper-rectangle induced by the interval partitioning from Step 1; sample one point from each of these cells (for deterministic inputs: centre of each cell) Note: LHDs need not be space-filling (i.e., may have uneven coverage of region) Potential remedies: - randomised orthogonal array designs: ensure that u-dimensional projections of points (for u = 1, ..., t) are regular grids; drawback: exist only for certain values of n and t - cascading LHDs: construct secondary LHDs for small regions around points of primary LHD - generate & test: construct many LHds and use additional criteria to select ‘good’ LHD (general g&t approach can also be applied to designs obtained from simple or stratified random sampling) Other advanced methods (not covered in detail here, but see, e.g., Santner et al., 2003): - distance-based designs (ensure spread of points) - uniform Designs (ensure uniformity of covering within exp region) - designs satisfying multiple criteria -> difficult to construct, often use heuristic search / optimisation methods also: hardly used for algorithm configuration Further reading: - Chapter 5 of T.J. Santner et al.: The Design and Analysis of Computer Experiments, Springer, 2003. --- 7.3 Racing approaches key idea behind racing [ask students]: - sequentially evaluate given set of algorithms/configurations - in each iteration, perform one new run per candidate - eliminate poorly performing candidates from the race as soon as sufficient evidence is gathered against them -- Hoeffding races (Maron & Moore, 1994): [named after Wassily Hoeffding, finish-born American statistician known as one of the founding fathers of the nonparametric statistics; 1914-1991] application context: - original paper: selecting a good model (one with minimal error) for classification, function approx - but: directly applies to much wider class of algorithms, including ones obtained from parameterised alg + set of configs idea: - given set S of blackbox (learning) alg ("learning box") "test set" T of size N - with each alg, associate number of test instances evaluated, estimate of error rate (=performance) - start the race with all algorithms from S - in each iteration of the Hoeffding race procedure, - randomly select in instance t from T - run each alg that are still (set S') in the race on t - update error estimates for all alg in S' - terminate after fixed number of iterations or when only one alg left [draw illustration - see also paper, Fig.2] some details: - error estimates are obtained from Hoeffding's formula which bounds deviation of true from estimated mean error for given confidence 1-delta, bound B on error on indiv instances [see p.3 of paper] - confidence 1-delta s chosen by user (e.g., 1-0.01/(n*m), where m = |S| = number of alg, n = number of test inst - bound B is interval of observed error values for given alg note: - provably correct with given (user-defined) confidence (means that probability of returning set of alg that does not include the best one is upper-bounded) (proof: not hard, see paper) note: - paper also mentions hillclimbing (for continous parameters) - we'll discuss this later -- F-Race (Birattari & al, 2002) [named after Milton Friedman - American economist, statistician and Nobel laureate; 1912-2006] inspired by Hoeffding race key idea: - sequentially evaluate algorithms/configuration, in each iteration, perform one new run per algorithm/configuration - eliminate poorly performing algorithms/configurations as soon as sufficient evidence is gathered against them - use Friedman test (aka: Friedman two-way analysis of variance by ranks) to detect poorly performing algorithms/configurations details: - Friedman test assesses whether m configurations show no significant performance differences on n given instances (null hyp) - if null rejected, i.e., some configurations perform better than others, perfom series of pairwise post hoc tests between the incumbent and all other configs - drop all configurations from race for which pairwise tests indicate significantly worse performance than incumbent note: - Friedman test is non-parametric, based on ranking configurations on each instance - ranking separately on each instance amounts to blocking (well-known variance reduction technique) - no proof of correctness (with bounded error) given, but possible (would require multiple testing correction) -- Iterated F-Race (Balaprakash & al., 2007) key limitation of F-Race: initial stages of procedure need to evaluate all given configurations -> infeasible for large configuration spaces solutions [ask students] - random sampling of configs, followed by F-Race -> RSD/F-Race - interleaved sampling of configs, selection by F-Race, where sampling is focussed on promising configs -> Iterated F-Race = I/F-Race Some details on I/F-Race: - three stages per iteration: 1. sample fixed number of configs based on (simplistic) probalistic model 2. perform std F-Race on configs from 1 (disjunctive termination criterion: min # configs, max time, min # instances) 3. update probabilistic model based on winning configs from 2 - first iteration: sample using indep uniform distributions (as iN RSD/F-Race) - prob model = one k-variate normal distribution (k = number of params) per config from stage 2; covariance 0 => k independent univariate normal distr - means = params for the configs from which distr is obtained - volume reduction: gradual, geometric decrease of variance => samples in later iterations increasingly focussed near configs already shown to be good (intensification) - sampling (stage 1) after first iteration: for each config, follow 2-stage process: 1. choose one of the prob models with prob depending on overall performance rank 2. sample one config from that model - input to stage 2 = configs from stage 2 of previous iteration + newly sampled configs from stage 1, total number = N_max (param of I/F-Race) -- note: RSD/F-Race, I/F-Race can deal with continuous parameters -- Successul applications of F-Race and variants: - MAX-MIN Ant System for the TSP (6 parameters) - simulated annealing for vehicle routing with stochastic demands (4 parameters) - estimation-based local search for PTSP = probabilistic TSP (3 parameters) (resulted in new state of the art alg on the last two of these) -- Further details: - Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation. O. Maron and Andrew W. Moore, In Advances in neural information processing systems 6, pp.59-66, 1994. - A Racing Algorithm for Configuring Metaheuristics. M. Birattari, T. Stuetzle, L. Paquete and K. Varrentrapp, Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pp.11-18, 2002. --- 7.4 Model-free search and genetic programming -- BasicILS [presentation by Shervin; see Hutter et al. 2007, 2009] -- FocusedILS [presentation by Shervin; see Hutter et al. 2007, 2009] -- Genetic programming / CLASS (only covered very briefly) general idea: - use stochastic local search in a space of algorithms (programs, automata, grammars) to find one that achieves a certain goal, here: optimised performance on given (training) instances - here: focus on evolutionary algorithms - important and widely used class of SLS algorithms inspired by (typically simple) models of evolution evolutionary algorithms in a nutshell: - use populations of candidate solutions (here: algorithms) - iteratively apply genetic operators: mutation, recombination, selection [illustrate briefly on the board, for details see SLS:FA, Ch.2] - fitness = performance measure to be optimised genetic programming often/typically used tree-based representations of the object being constructed [briefly discuss mutation, recombination on trees] examples: - tree-based representation of an algorithm for the "snake game" -> http://www.gamedev.net/reference/articles/article1175.asp - tree-based representation of a heuristic function (used inside an algorithm, e.g., for SAT) -> Fukunaga (2004) some other successful applications of genetic programming: quantum computing, electronic design, game playing, sorting, searching (see http://www.genetic-programming.com/humancompetitive.html; some of the evolved algorithms are patented -> automated invention) note: GP widely seen as a machine learning / AI technique, but unfortunately not very widely used/regarded by these communities. Still, widely applied paradigm (including industry), wider area of evolutionary algorithms / evolutionary computation has its own conferences, journals, ... -- Further details: - Automatic Algorithm Configuration based on Local Search. by Frank Hutter, Holger H. Hoos and Thomas Stuetzle. Proceedings of the 22nd Conference on Artificial Intelligence (AAAI-07), pp. 1152-1157, 2007. - ParamILS: An Automatic Algorithm Configuration Framework. by Frank Hutter, Holger H. Hoos, Kevin Leyton-Brown and Thomas Stützle - Journal of Artificial Intelligence Research, Volume 36, pp. 267-306, 2009. - Evolving local search heuristics for SAT using genetic programming. by A.S. Fukunaga. In: Genetic and Evolutionary Computation – GECCO-2004, Part II, volume 3103 of Lecture Notes in Computer Science, pages 483–494, Springer-Verlag. --- 7.5 Sequential experimental design procedures (SPO, SKO and related approaches) [guest lecture by Frank Hutter - slides will be made available, please take your own notes] --- 7.6 Adaptive tuning and reactive search (aka dynamic algorithm control) [time permitting - only covered very briefly this time] key idea: modify algorithm configuration (parameters) while solving a given instance, typically based on dynamic analysis of progress / intermediate or candidate solutions note: not blackbox - more problem- and algorithm-specific than previous approaches two fundamental approaches: 1. exploit problem- or algorithm-specific knowledge to build dynamic control scheme 2. use off-the-shelf machine learning techniques (in particular, reinforcement learning) in combination with problem-/algorithm-specific features (middle ground exists) note: the mechanisms used for modifying parameters during execution can/should be configured using techniques discussed earlier in this module. (this includes selection of mechanisms / components as well as setting of static parameters - i.e., ones not changing during execution of target algorithm - controlling their behaviour) closely related, but not covered here: - instance-based algorithm selection (see Module 9) - dynamic algorithm selection (see Module 9 for pointers) Some literature: - Battiti, R. and Protasi, M. (1997). Reactive search, a history-based heuristic for MAX-SAT. ACM Journal of Experimental Algorithmics, 2. - Battiti, R., Brunato, M., and Mascia, F. (2008). Reactive Search and Intelligent Optimization, volume 45 of Operations Research/Computer Science Interfaces. Springer Verlag. - Holger H. Hoos. An Adaptive Noise Mechanism for WalkSAT. Proceedings of the 18th National Conference on Artificial Intelligence (AAAI-02), pp. 655-660, AAAI Press / The MIT Press, 2002. (and references therein) --- learning goals: - be able to formally state and explain the algorithm configuration problem for a given set or distribution of instances - be able to explain why standard optimisation techniques are often not applicable to practically solve algorithm configuration problems - be able to list and explain various types of parameters and to explain their impact on the choice of methods to be used for algorithm configuration problems - be able to explain basic experimental design procedures, particularly including Latin Hypercube designs, their use and limitations for algorithm configuration - be able to explain how racing procedures, including Hoeffding races, F-Race and Iterative F-Race solve algorithm configuration problems (including algorithmic details) - be able to explain how model-free methods, in particular, ParamILS and genetic programming, solve algorithm configuration problems - be able to explain sequential experimental design procedures for algorithm configuration, and to contrast them with model-free methods and basic experimental design methods - be able to explain the concept of adaptive tuning / reactive search, and to explain at least two approaches to implementing it at the conceptual level