CPSC 536H: Empirical Algorithmics (Spring 2008) Notes by Holger H. Hoos, University of British Columbia --------------------------------------------------------------------------------------------- Module 2: Deterministic algorithms for decision problems --------------------------------------------------------------------------------------------- 2.1 Decision problems Given: Input data (e.g., graph G and an integer, k) Objective: Output “yes” or “no” answer (e.g., to the question "is there a clique of size k in G") Other examples: - primality: given an integer n, is n a prime number? - SAT: given a propositional formula F, is there an assignment a of truth values to the variables in F such that F is true under a? - scheduling: given a set of resources R, a set of tasks T with resource requirements and a time t, can all tasks in T be accomplished within time t? Note: Formally, decision problems can be represented by the set of all "yes" instances, i.e., the set of input data for which the answer is "yes" (this set may be hard or impossible to compute) --- 2.2 Deterministic decision algorithms A decision algorithm is called deterministic if its behaviour is completely determined by the given input data. Here: - Consider only error-free decision algorithms, i.e., algorithms that never give an incorrect answer. - Consider only algorithms that terminate on every given problem instance. - Focus on measuring performance of algorithm (typically run-time required for solving a given problem instance, but can also be consumption of other resources, in particular, memory) Note: - Input data can comprise the specification of the problem instance as well as parameter settings. - The performance of a given deterministic algorithm A can significantly depend on its implementation and the environment in which it is executed. Execution environment = machine M + operating system S + everything that happens on M other than execution of A Note: - The execution environment can be difficult to control, which may cause noise in performance measurements (e.g., due to other processes in multi-tasking operating systems, cache effects, network activity) How to specify an execution environment? [ask students] - machine parameters: processor make + model + speed, size of cache and RAM; sometimes also required: disk size + speed, network type + speed, ... - operating system make + version, ... Note: - concurrent processes are hard or impossible to specify -> try to control (by type of measurement performed - discussed later, restricting access to machine, avoiding heavy loads that may cause substantial amounts of page swapping, network traffic, ...) How to specify an implementation? [ask students] - report all non-obvious details that impact performance (these may depend substantially on performance measure, see below) - make source code (or at least executable) publically available (to ensure reproducibility of results) - report programming language and compiler (+ relevant settings, in particular, optimisations) How to specify input? - completely specify problem instances used + all parameter settings Note: - problem instances can be specified by reference (to literature or benchmark collection) or content (complete set of instance data) - care should be taken when using instance generators: always specify generator + all input (including random number seed for random instance generators) Question [ask students]: When using a random instance generator, why is it important to completely specify its input (including seed) rather than just saying something like 'we used a sample of 100 instances obtained from generator X with parameters P'? [Answer: extreme variation, outliers] - Pitfall: on-line resources can easily change or disappear! General issue: Benchmark selection Some criteria for constructing/selecting benchmark sets: - instance hardness (focus on challenging instances) - instance size (provide range to facilitate scaling studies) - instance type (provide variety): - individual application instances - hand-crafted instances (realistic, artificial) - ensembles of instances from random distributions (random instance generators) - encodings of various other types of problems (e.g., SAT-encodings of graph colouring problems) To ensure comparability and reproducibility of results: - use established benchmark sets from public benchmark libraries (e.g., SATLIB, TSPLIB, CSPLIB, ORLIB, etc.) and/or related literature; - make newly created test-sets available to other researchers (via submission to benchmark libraries, publication on WWW) Note: Careful selection and good understanding of benchmark sets are often crucial for the relevance of an empirical study! --- 2.3 Empirical analysis of a single algorithm For a given instance: - run algorithm and measure performance Issues: - choose performance measure - control execution environment (see above) General issue: Measuring run-time Note: For now, we will focus on run-time as the primary performance measure. Wall clock time vs. CPU time [prefer the latter - why?] To ensure reproducibility and comparability of empirical results, CPU times should be measured in a way that is as independent as possible from machine load. Note: - CPU timing has limited resolution - beware of floor effects! Practical issue: How to report measurements of zero CPU time? [as '< res'] [Briefly show and discuss timer code for Linux] To achieve better abstraction from the implementation and run-time environment, it is often preferable to measure run-time using - operation counts that reflect the number of operations that contribute significantly towards an algorithm's performance, and - cost models that specify the CPU time for each such operation for given input data, implementation and execution environment. Note: - cost models typically depend significantly on instance data and can depend on algorithm parameters Pitfalls: - counting operations whose costs vary for fixed input data, implementation + execution environment and use these as surrogates for CPU time - using cost models that do not reflect important (and costly) operations Additional benefits of operation counts: - can achieve better resolution than direct CPU timing Question: How can we determine CPU time / operation in that case? [ask students] [Answer: measure time for sequence of operations] For a set of problem instances: - on each instance, run algorithm and measure performance - analyse results Search cost distributions (SCDs): - distribution of search cost over a set of problem instances - for instances obtained form random generator: SCD = empirical probability distribution General issue: Analysing and summarising distributions Graphical representations: - CDFs vs PDFs - PDFs are typically preferable [Why? Ask students. Answer: binning effects, quantiles, readability of plots with multiple distributions] - use of log plots [examples in gnuplot] - modes [What do modes mean? Ask students.] - heavy (=long) vs. fat tails Def: rand var X has a heavy tail on the right, iff P(X > x) \sim x^(-\alpha) with 0 < \alpha < 2 => power-law decay of the right tail fat tail = tails fatter than in a Gaussian (=normal distr) [as measured by kurtosis] [What do fat / heavy tails mean? Ask students.] Notes: heavy-tailed distributions - have infinite mean, can have infinite variance -> instability of sample means, ... - have been used for modelling many phenomenae including network traffic and behaviour of search algorithms for NP-hard problems - are closely related to self-similar phenomenae, self-similar structures - heavy left tails?? [see also: http://en.wikipedia.org/wiki/The_long_tail, http://en.wikipedia.org/wiki/Long-range_dependency] - outliers - def: x is an outlier if it is more than 1.5 times inter-quartile range (IQR) from closest quartile, i.e., min{|x-q_0.75|,|q_0.25-x|}>= 1.5*(q_0.75-q_0.25) Descriptive statistics (summarise distribution): - location: mean, median, quantiles med for even number of samples = avg of middle two values (or rounded up) Def: p-quantile q_p of random var X = value x s.t. P(X <= x) >= p) *and* P(X >= x) <= 1-p => q_0 = min, q_1 = max estimates for sample quantiles: various algorithms (rounding, interpolation) - not much of an issue for large samples frequently used quantiles: q_0.5 = median, q_0.25, q_0.75 = quartiles quantiles are often preferable over means [Why? Ask students. Answer: stat stability] - spread: var / stddev, quantile ranges, quantile ratios quantile ranges or ratios often preferable over var / stddev - higher moments: - (sample) skewness (3rd moment) = measure of asymmetry sqrt(n) * sum_1..N(x_i - mean(x))^3 / [sum_1..N(x_i - mean(x))^2]^3/2 - (sample) kurtosis (4th moment) = measure of 'peakedness' (also reflects 'fatness' of tails) Box plot (Tukey, 1977): - box = q_0.25, q_75 (quartiles), line = median, whiskers = q_0.25-1.5*IQR, points = outliers (all of them) [Draw illustration] [see also http://web2.concordia.ca/Quality/tools/4boxplots.pdf] Note: - sometimes, extreme outliers are distinguished from mild ones, where e.o. are more than 3*IQR from closest quartile - variations of the concept exist, e.g., def of whiskers, additional indication of mean, ... Example: Run-time data from a study of a heuristic MAX-CLIQUE algorithm from Franco Mascia (Univ Trento) [Show plot] Note: SCDs on hard combinatorial problems are often not normally distributed, often have very high variance, long tail(s) -> be careful when using statistical tests!! Characterisation by means of known, parametric distributions (function fitting) [discuss only briefly here] When to summarise results on benchmark sets? [Ask students.] - when instances come from a distribution (e.g., random number generator, other stochastic process) - when dealing with a large number of instances - caution: when looking at summary statistics only, it is sometimes easy to miss important effects - in particular, when summarising over heterogeneous test sets --- 2.4 Correlation between instance properties and performance Goal: analyse / characterise impact of instance properties on performance of algorithm Simple qualitative analysis: Plot correlation between given instance property and performance, one data point per instance (scatter plot) Simple quantitative analysis: standard (Pearson) correlation coefficient measures linear correlation only |r| = 1 <=> perfect linear correlation |r| = 0 <=> no correlation can use nonlinear transformations (particularly log, loglog) to test non-linear dependencies Question: When is an observed correlation statistically significant? => use statistical hypothesis test to assess significance -- Recap / brief intro: statistical hypothesis tests null hypothesis (H_0): hypothesis of no change or experimental effect e.g., two randomised algorithms have the same mean run-time for given input (default outcome of test; this is often the opposite of what we really want to show) alternative hypothesis (H_a): hypothesis of change or experimental effect e.g., ... have not ... types of error: type I: erroneous rejection of true H_0 (prob alpha) type II: erroneous failure to reject false H_0 (prob beta) alpha = prob of type I error = significance level of test controlled by experimenter, depending on acceptable risk of type I error common values: 0.05 and 0.01 confidence level = 1 - alpha beta = prob of type II error power = 1-beta general testing procedure: - set alpha - compute rejection region C of test statistic from alpha - compute test statistic T from data - if T in C, then reject H_0 else, do not reject H_0 p-value (significance probability): probability of obtaining a value for test statistic that is at least as extreme as observed value, assuming H_0 is true => checking T in C is equivalent to checking p-value < alpha: NOTE: failure to reject H_0 does not prove H_0 is true at any confidence level! (type II errors) -- General issue: testing the significance of correlations Spearman’s rank order test given: two samples H_0: no significant (monotone) relationship between samples [TODO: check!] statistic: based on Spearman rank order correlation coefficient = Pearson corr coeff on the ranks of the (ordered) data note: does not require normality assumption, is not restricted to linear correlation (unlike tests based on Pearson corr coeff) in R: cor.test(corr$V1,corr$V2, method="spear") [Similar: Kendall's tau test; in R: cor.test(corr$V1,corr$V2, method="kendall")] Note: These non-parametric tests are preferable over tests based on the Pearson correlation coefficient, which require the assumption that the SCDs are normal distributions. -- Analyse scaling of performance with instance size: - measure performance for various instance sizes - exploratory analysis: - use scaling plot for inititial visual analysis (note: log / loglog plots can be very useful in this context) - fit one ore more parametric models to data points (e.g., \alpha * e^(\beta*x), a * x^b, ... using continuous optimisation technique (e.g., 'fit' function in gnuplot - caveat: local minima, divergence!) Practical tip: check all fits visually; fit multiple times with different initial values for the parameters, encouraging the optimiser to approach best values from below and above. - RMSE (= root mean squared error = rms of residuals) can be used for an initial assessment of the relative fit of various models. [see m2-scaling.pdf] - confirmatory analysis: - challenge model by interpolation or extrapolation (i.e., compare predictions obtained from model against actual data - the latter data must not have been used previously for fitting the model) - use bootstrapping for checking statistical significance of differences/agreement between predictions and observations from previous step bootstrapping for scaling analysis: given performance measures for m problem instances per size, for k times: draw performance measures for l