Algorithm Configuration Landscapes:
Analysis and Exploitation


Introduction

Automated algorithm configuration finds high quality parameter settings that can improve algorithm performance by several orders of magnitude. Formally, automated algorithm configuration can be defined as an optimization problem. The goal is to improve the performance of algorithm A with parameters p, by optimizing it with respect to metric m on a set I of problem instances.

Existing algorithm configuration procedures are typically heavily based on powerful meta-heuristics originally designed for complex and challenging search landscapes. The use of these heuristics implicitly assumes that configuration landscapes are also challenging to optimize.

We hypothesize that algorithm configuration landscapes are typically much more benign than previously expected. By way of analogy, suppose your goal is to bake a perfect cake. Consider three of the parameters that you can control while baking your cake: the baking temperature, the baking time and the amount of vanilla flavouring. Intuitively, we expect changes to all of these parameters to yield simple and predictable changes in the quality of your cake:

  1. Uni-modal responses. Quite likely, changing either the temperature or time simply corresponds to trading off between too little (your cake not baking) and too much (your cake burning).
  2. Compensatory interactions. Suppose you have set your baking temperature a little bit too high. In this case, you should be able to compensate, by decreasing the baking time by a small amount.
  3. Neglible interactions. For example, if you change the amount of vanilla flavouring, this is unlikely to have a noticable impact on the optimal baking time.


Analysis

We performed the first statistically-principled analysis of two different important types of algorithm configuration landscapes:

  1. Running time minimization landscapes. In this application, the goal is to reduce the running time required to find (optimal) solutions to a given problem instance, thereby reducing the computing resources and power required to solve computational problems. We studied six prominent algorithms for solving NP-hard and NP-complete problems spanning boolean satisfiability (SAT), mixed integer programming (MIP) and the travelling salesperson problem (TSP).
  2. Automated machine learning (AutoML) loss landscapes. In this application, the goal is to automatically find high-quality settings of a machine learning algorithm's hyper-parameters to maximize a model's predictive accuracy, or, equivalently, to minimize the model's validation loss, thereby making machine learning easier to use for both experts and non-experts alike. We studied eight very well-known, state-of-the-art machine learning methods, including XGBoost, online latent Dirichlet allocation (LDA) and feed-forward neural networks.

In both applications, we found remarkably little evidence that suggests we should reject either of our hypotheses. In nearly every case we studied, both the individual parameters and the full algorithm configuration landscapes were statistically indistinguishable from being uni-modal, at a 95% signficance level. There were a small number of exceptions to these trends; however, in each case the deviations from uni-modality were very small (see Pushak & Hoos [2018, 2022a] and Chapter 6.3 of Pushak [2022]).

Similarly, we observed little evidence that the interactions between the parmaeters substantially increases the complexity of algorithm configuration problems. In particular, we used a simplistic configuration procedure that naïvely assumes each parameter can be optimized a single time, and in a random order. In all cases, we observed this procedure found final configurations that were statistically tied with optimal with very high probability (see Pushak & Hoos [2022a] and Chapter 6.3 of Pushak [2022]). However, despite this simplicity, we still observed that it is not safe to build surrogate functions that assume the parameters do not interact at all. Why? We hypothesize that it is due to the occasional presence of strong, compensatory interactions (see Pushak & Hoos [2022b]).

The biggest difference that we observed between the two types of algorithm configuration landscapes is their stochasticity. In particular, the variance in the running times of the algorithms for solving NP-hard problems tended to be much larger than the variance in the predictive accuracy of machine learning models. This makse sense; the accuracy of machine learning models is often bounded between 0 and 1, and the running time of algorithms can be impacted by the choice of the random seed used, background environmental noise (e.g., concurrently running processes) and the difficulty of the problem instance being solved.


Exploitation

Golden Parameter Search (GPS) is the first of a new generation of automated algorithm configuration procedures that exploit landscape structure. GPS assumes that all numerical parameters are globally uni-modal and that most parameters do not interact strongly. These assumptions allow GPS to quickly and efficiently explore an algorithm's parameter configuration space. GPS is also equipped with several key ingredients from other automated algorithm configuration procedures (e.g., racing procedures, intensification mechanisms and an improved version of adaptive capping) that allow it to handle high-variance performance measurements, thereby make it particularly well-suited for running time minimization scenarios.

GPS is also a highly parallelized method, capable of efficiently scaling from using as few as 2 CPU cores up to a theoretically unlimitted number. GPS employs multiple granularities of parellelism: it performs a semi-independent search process for each parameter (which allows it to exploit both neglible and compensatory interactions), and it further parallelizes the individual target algorithm runs required to evaluate multiple parameter configurations on multiple problem instances. The more parameters your algorithm has and the longer it takes to perform target algorithm runs — i.e. the harder your problem is to solve — the more GPS will be able to effectively make use of a large number of parallel resources.

For running time minimization scenarios, GPS often finds equal or better configurations using smaller configuration budgets than several other long-standing, state-of-the-art algorithm configuration procedures [Pushak & Hoos, 2020].


People


Publications