CPSC 536H: Empirical Algorithmics (Spring 2012) Notes by Holger H. Hoos, University of British Columbia ------------------------------------------------------- Module 1: General considerations for empirical analysis ------------------------------------------------------- Purpose: Discuss some issues that come up in all or most empirical analyses of algorithms (in fact, since design relies analysis, design tasks also depend on the same issues) 1.1 How to specify an execution environment? Execution environment = machine M + operating system S + everything that happens on M other than execution of A Note: - The execution environment can be difficult to control, which may cause noise in performance measurements (e.g., due to other processes in multi-tasking operating systems, cache effects, network activity) How to specify an execution environment? [ask students] - machine parameters: processor make + model + speed, size of cache and RAM; sometimes also required: disk size + speed, network type + speed, ... - operating system make + version, ... Note: - concurrent processes are hard or impossible to specify -> try to control (by type of measurement performed - discussed later, restricting access to machine, avoiding heavy loads that may cause substantial amounts of page swapping, network traffic, ...) 1.2 How to specify an implementation? [ask students] - report all non-obvious details that impact performance (these may depend substantially on performance measure, see below) - make source code (or at least executable) publically available (to ensure reproducibility of results) - report programming language and compiler (+ relevant settings, in particular, optimisations) 1.3 How to specify input? - completely specify problem instances used + all parameter settings Note: - problem instances can be specified by reference (to literature or benchmark collection) or content (complete set of instance data) - care should be taken when using instance generators: always specify generator + all input (including random number seed for random instance generators) Question [ask students]: When using a random instance generator, why is it important to completely specify its input (including seed) rather than just saying something like 'we used a sample of 100 instances obtained from generator X with parameters P'? [Answer: extreme variation, outliers] - Pitfall: on-line resources can easily change or disappear! 1.4 Benchmark selection Some criteria for constructing/selecting benchmark sets: - instance hardness (focus on challenging instances) - instance size (provide range to facilitate scaling studies) - instance type (provide variety): - individual application instances - hand-crafted instances (realistic, artificial) - ensembles of instances from random distributions (random instance generators) - encodings of various other types of problems (e.g., SAT-encodings of graph colouring problems) To ensure comparability and reproducibility of results: - use established benchmark sets from public benchmark libraries (e.g., SATLIB, TSPLIB, CSPLIB, ORLIB, etc.) and/or related literature; - make newly created test-sets available to other researchers (via submission to benchmark libraries, publication on WWW) Note: Careful selection and good understanding of benchmark sets are often crucial for the relevance of an empirical study! 1.5 How to measure performance? What are typical performance measure we might be interested in? [ask students] Note: For now, we will focus on run-time as the primary performance measure. Wall clock time vs. CPU time [prefer the latter - why?] To ensure reproducibility and comparability of empirical results, CPU times should be measured in a way that is as independent as possible from machine load. Note: - CPU timing has limited resolution - beware of floor effects! Practical issue: How to report measurements of zero CPU time? [as '< res'] [Briefly show and discuss timer code for Linux] To achieve better abstraction from the implementation and run-time environment, it is often preferable to measure run-time using - operation counts that reflect the number of operations that contribute significantly towards an algorithm's performance, and - cost models that specify the CPU time for each such operation for given input data, implementation and execution environment. Note: - cost models typically depend significantly on instance data and can depend on algorithm parameters Pitfalls: - counting operations whose costs vary for fixed input data, implementation + execution environment and use these as surrogates for CPU time - using cost models that do not reflect important (and costly) operations Additional benefits of operation counts: - can achieve better resolution than direct CPU timing Question: How can we determine CPU time / operation in that case? [ask students] [Answer: measure time for sequence of operations] 1.6 How to report results? [see Peter Sanders: Presenting Data from Experiments in Algorithmics (and related work mentioned there)] 1.7 Exploratory vs confirmatory analysis Most empirical studies have two components: - exploratory analyses: collect observations, look at data, try to see trends, regularities, patterns results: new hypotheses, model modifications, ideas for experiments often rather informal, guided by intuition and (some) experience - confirmatory analyses: carefully design and conduct experiments to answer very specific, technical questions results: quantitative answers to questions typically very technical, using established statistical methods (e.g., hypotheses tests) Note: - In many cases, researchers repeatedly switch between these stages. - Tools and techniques exist for both types of analyses, but especially exploratory analyses typically require creativity, experience, good judgement. 1.8 Statistical hypothesis tests null hypothesis (H_0): hypothesis of no change or experimental effect e.g., two randomised algorithms have the same mean run-time for given input (default outcome of test; this is often the opposite of what we really want to show) alternative hypothesis (H_a): hypothesis of change or experimental effect e.g., ... have not ... types of error: type I: erroneous rejection of true H_0 (prob alpha) type II: erroneous failure to reject false H_0 (prob beta) alpha = prob of type I error = significance level of test controlled by experimenter, depending on acceptable risk of type I error common values: 0.05 and 0.01 confidence level = 1 - alpha beta = prob of type II error power = 1-beta general testing procedure: - set alpha - compute rejection region C of test statistic from alpha - compute test statistic T from data - if T in C, then reject H_0 else, do not reject H_0 p-value (significance probability): probability of obtaining a value for test statistic that is at least as extreme as observed value, assuming H_0 is true => checking T in C is equivalent to checking p-value < alpha: NOTE: failure to reject H_0 does not prove H_0 is true at any confidence level! (type II errors) note: the power of a test depends on sample size. the nature of this relationship can be difficult to determine - but power calculations or estimations exist for most standard tests. Example 1: Spearman’s rank order test purpose: test significance of correlation between paired observations given: two samples (paired) H_0: no significant (monotone) relationship between samples statistic: based on Spearman rank order correlation coefficient = Pearson corr coeff on the ranks of the (ordered) data note: does not require normality assumption, is not restricted to linear correlation (unlike tests based on Pearson corr coeff) in R: cor.test(corr$V1,corr$V2, method="spear") [Similar: Kendall's tau test; in R: cor.test(corr$V1,corr$V2, method="kendall")] This is a non-parametric test, i.e., it does not make any assumption regarding the type of the distribution underlying the samples. (In particular, no assumption of normality is made.) Non-parametric tests need to be used when distributional assumptions of parametric tests are or may be violated. Here: Spearman's rank order test and Kendall's tau test are preferable over tests based on the Pearson correlation coefficient, which require the assumption that the samples are normally distributed. Notes: - When using parametric tests, need to use specific tests to check distributional assumptions (e.g., using Shapiro-Wilke test for normality). - When distributional assumptions are met, parametric tests are usually more powerful than non-parametric tests, i.e., they achieve the same power for smaller sample sizes. Statistical tests based on resampling methods (data/compute-intensive methods) - bootstrap - jackknife - permutation/randomisation tests - cross validation Key idea: - use a sample drawn from a population of interest to represent that population - resample from that data (with or without replacement, depending on the specific resampling procedure used) to make inferences about the population, in particular, assess stability of of statistics (e.g., mean, median, ...) and test hypotheses. Example 2: Permutation test for equality of means purpose: test significance of differences observed between the means of paired observations (e.g., of running times) given: two samples (paired), (x_1,y_1), ..., (x_n, y_n) H_0: the two distributions underlying the paired observations, X and Y, are the same, and therefore there is no significant difference between the means H_a: mean(X) > mean (Y) (=> one-sided test) procedure: 1. compute observed difference d := mean_i{x_i}-mean_i{y_i}; this is the test statistic 2. pool the x_i and y_i into one list L 3. repeat k times determine a permutation of L uniformly at random use the first half of L as x'_i's and the second half as y'_i's (=> permutation resample) compute d' := mean_i{x'_i}-mean_i{y'_i} 4. consider the distribution D of the d's over all permutation resamples (=> permutation distribution) 5. the location of d on D determines the p-value one-sided test => empirical probability that values in D are >= d [question to students: relation to quantiles?] 6. compare p-value against chosen significance level to determine rejection of H_0 typical number l of resamples: ~100 - ~1000 note: this procedure treats the data effectively as unpaired - not a problem here, because comparison of means doesn't depend on pairing. equivalent to the method used in step 3 above, and more general (because it preserves the pairing), one can swap x_i,y_i for each i with prob 0.5 independently at random [question to students: how to investigate two-sided H_a?] [see also http://bcs.whfreeman.com/ips5e/content/cat_080/pdf/moore14.pdf] --- learning goals (for module 1): - be able to adequately specify execution environments, implementations and inputs - be able to explain and apply guidelines for benchmark selection - be able to explain and use methods for run-time measurement, including CPU timing and cost models based on operation counts - be able to effectively use tables and graphs for reporting empirical results - be able to explain the purpose and differences between exploratory and confirmatory analysis - be able to explain the differences between parametric and non-parametric hypothesis tests - be able to explain the basic idea behind resampling methods for estimating the precision of sample statistics and for testing hypotheses: bootstrap, permutation test, cross validation - be able to determine and correctly apply appropriate statistical tests for various standard purposes, including correlation between paired data, equality of means, normality and equality of distributions