CPSC 536H: Empirical Algorithmics (Spring 2008)
Notes by Holger H. Hoos, University of British Columbia

---------------------------------------------------------------------------------------------
Module 2: Deterministic algorithms for decision problems 
---------------------------------------------------------------------------------------------

2.1 Decision problems

Given: Input data (e.g., graph G and an integer, k)

Objective: Output “yes” or “no” answer (e.g., to the question 
"is there a clique of size k in G")

Other examples:
- primality: given an integer n, is n a prime number?
- SAT: given a propositional formula F, is there an assignment a of truth values to the variables in F such that F is true under a?
- scheduling: given a set of resources R, a set of tasks T with resource requirements and a time t, 
	can all tasks in T be accomplished within time t?
 
Note: Formally, decision problems can be represented by the set of all "yes" instances, i.e., the set of
	input data for which the answer is "yes"
	(this set may be hard or impossible to compute)


---
2.2 Deterministic decision algorithms 

A decision algorithm is called deterministic if its behaviour is completely determined by the given input data.

Here: 
- Consider only error-free decision algorithms, i.e., algorithms that never give an incorrect answer.
- Consider only algorithms that terminate on every given problem instance.
- Focus on measuring performance of algorithm (typically run-time required for solving a given problem instance, 
  but can also be consumption of other resources, in particular, memory)

Note: 
- Input data can comprise the specification of the problem instance as well as parameter settings.
- The performance of a given deterministic algorithm A can significantly depend on its implementation
  and the environment in which it is executed.

Execution environment 
= machine M + operating system S + everything that happens on M other than execution of A

Note:
- The execution environment can be difficult to control, which may cause noise in performance measurements
  (e.g., due to other processes in multi-tasking operating systems, cache effects, network activity)


How to specify an execution environment?
[ask students]
- machine parameters: processor make + model + speed, size of cache and RAM;
	sometimes also required: disk size + speed, network type + speed, ...
- operating system make + version, ...

Note:
- concurrent processes are hard or impossible to specify 
  -> try to control (by type of measurement performed - discussed later, 
	restricting access to machine, avoiding heavy loads that may cause
	substantial amounts of page swapping, network traffic, ...)
	

How to specify an implementation? 
[ask students]

- report all non-obvious details that impact performance
  (these may depend substantially on performance measure, see below)
- make source code (or at least executable) publically available
  (to ensure reproducibility of results)
- report programming language and compiler (+ relevant settings, in particular, optimisations)


How to specify input?

- completely specify problem instances used + all parameter settings

Note:
- problem instances can be specified by reference (to literature or benchmark collection)
  or content (complete set of instance data)
- care should be taken when using instance generators: always specify generator + all input
  (including random number seed for random instance generators)

  Question [ask students]: When using a random instance generator, why is it important to 
	completely specify its input (including seed) rather than just saying something like
	'we used a sample of 100 instances obtained from generator X with parameters P'?
  [Answer: extreme variation, outliers]

- Pitfall: on-line resources can easily change or disappear!


General issue: Benchmark selection

Some criteria for constructing/selecting benchmark sets:
- instance hardness (focus on challenging instances)
- instance size (provide range to facilitate scaling studies)
- instance type (provide variety):
  - individual application instances
  - hand-crafted instances (realistic, artificial)
  - ensembles of instances from random distributions
    (random instance generators)
  - encodings of various other types of problems
    (e.g., SAT-encodings of graph colouring problems)

To ensure comparability and reproducibility of results:
- use established benchmark sets from public benchmark libraries 
  (e.g., SATLIB, TSPLIB, CSPLIB, ORLIB, etc.) and/or related literature;
- make newly created test-sets available to other researchers
  (via submission to benchmark libraries, publication on WWW)

Note: Careful selection and good understanding of benchmark sets
are often crucial for the relevance of an empirical study!


---
2.3 Empirical analysis of a single algorithm

For a given instance: 
- run algorithm and measure performance

Issues: 
- choose performance measure
- control execution environment (see above)


General issue: Measuring run-time

Note: For now, we will focus on run-time as the primary performance measure.

Wall clock time vs. CPU time [prefer the latter - why?]

To ensure reproducibility and comparability of empirical results, 
CPU times should be measured in a way that is as independent as possible from machine load.

Note:
- CPU timing has limited resolution - beware of floor effects!
  Practical issue: How to report measurements of zero CPU time?	[as '< res']

[Briefly show and discuss timer code for Linux]

To achieve better abstraction from the implementation and run-time environment, 
it is often preferable to measure run-time using

- operation counts that reflect the number of operations that
contribute significantly towards an algorithm's performance,

and

- cost models that specify the CPU time for each such operation
for given input data, implementation and execution environment.

Note:
- cost models typically depend significantly on instance data
  and can depend on algorithm parameters

Pitfalls: 
- counting operations whose costs vary for fixed input data, implementation 
  + execution environment and use these as surrogates for CPU time
- using cost models that do not reflect important (and costly) operations

Additional benefits of operation counts: 
- can achieve better resolution than direct CPU timing
  Question: How can we determine CPU time / operation in that case? [ask students]
  [Answer: measure time for sequence of operations]


For a set of problem instances: 
- on each instance, run algorithm and measure performance
- analyse results

Search cost distributions (SCDs):
- distribution of search cost over a set of problem instances
- for instances obtained form random generator: SCD = empirical probability distribution


General issue: Analysing and summarising distributions

Graphical representations:
- CDFs vs PDFs - PDFs are typically preferable 
  [Why? Ask students. Answer: binning effects, quantiles, readability of plots with multiple distributions]
- use of log plots
[examples in gnuplot]
- modes
  [What do modes mean? Ask students.]
- heavy (=long) vs. fat tails
	Def: rand var X has a heavy tail on the right, iff P(X > x) \sim x^(-\alpha)  with 0 < \alpha < 2
	=> power-law decay of the right tail
  fat tail = tails fatter than in a Gaussian (=normal distr) [as measured by kurtosis]
  [What do fat / heavy tails mean? Ask students.]
	
  Notes: heavy-tailed distributions
  - have infinite mean, can have infinite variance -> instability of sample means, ...
  - have been used for modelling many phenomenae including network traffic
	and behaviour of search algorithms for NP-hard problems
  - are closely related to self-similar phenomenae, self-similar structures
  - heavy left tails??
  [see also: http://en.wikipedia.org/wiki/The_long_tail, http://en.wikipedia.org/wiki/Long-range_dependency]
- outliers - def: x is an outlier if it is more than 1.5 times inter-quartile range (IQR) 
	from closest quartile, i.e., min{|x-q_0.75|,|q_0.25-x|}>= 1.5*(q_0.75-q_0.25)


Descriptive statistics (summarise distribution):

- location: mean, median, quantiles

  med for even number of samples = avg of middle two values (or rounded up)

  Def: p-quantile q_p of random var X = value x s.t. P(X <= x) >= p) *and* P(X >= x) <= 1-p 
	=> q_0 = min, q_1 = max 
  estimates for sample quantiles: various algorithms (rounding, interpolation) - not much of an issue for large samples
  frequently used quantiles: q_0.5 = median, q_0.25, q_0.75 = quartiles 
  quantiles are often preferable over means 
  [Why? Ask students. Answer: stat stability]

- spread: var / stddev, quantile ranges, quantile ratios
  quantile ranges or ratios often preferable over var / stddev

- higher moments: 
  - (sample) skewness (3rd moment) = measure of asymmetry
	sqrt(n) * sum_1..N(x_i - mean(x))^3 / [sum_1..N(x_i - mean(x))^2]^3/2 
  - (sample) kurtosis (4th moment) = measure of 'peakedness' (also reflects 'fatness' of tails)


Box plot (Tukey, 1977):
- box = q_0.25, q_75 (quartiles), line = median, 
	whiskers = q_0.25-1.5*IQR, 
	points = outliers (all of them)
[Draw illustration]
[see also http://web2.concordia.ca/Quality/tools/4boxplots.pdf]

Note: 
- sometimes, extreme outliers are distinguished from mild ones, where e.o. are 
	more than 3*IQR from closest quartile
- variations of the concept exist, e.g., def of whiskers, additional indication of mean, ...
	

Example: Run-time data from a study of a heuristic MAX-CLIQUE algorithm from Franco Mascia (Univ Trento)
[Show plot]

Note: SCDs on hard combinatorial problems are often not normally distributed,
	often have very high variance, long tail(s)
	-> be careful when using statistical tests!!

Characterisation by means of known, parametric distributions (function fitting) 
[discuss only briefly here]

When to summarise results on benchmark sets? [Ask students.]
- when instances come from a distribution (e.g., random number generator, other stochastic process)
- when dealing with a large number of instances
- caution: when looking at summary statistics only, it is sometimes easy to miss 
	important effects - in particular, when summarising over heterogeneous test sets


---
2.4 Correlation between instance properties and performance

Goal: analyse / characterise impact of instance properties on performance of algorithm

Simple qualitative analysis: Plot correlation between given instance property and performance, 
	one data point per instance (scatter plot)

Simple quantitative analysis: standard (Pearson) correlation coefficient
	measures linear correlation only
	|r| = 1 <=> perfect linear correlation
	|r| = 0 <=> no correlation
	can use nonlinear transformations (particularly log, loglog) to test non-linear dependencies

Question: When is an observed correlation statistically significant?
	=> use statistical hypothesis test to assess significance

--
Recap / brief intro: statistical hypothesis tests

null hypothesis (H_0): hypothesis of no change or experimental effect
	e.g., two randomised algorithms have the same mean run-time 
		for given input
	(default outcome of test; this is often the opposite of what we really want to show)

alternative hypothesis (H_a): hypothesis of change or experimental effect
	e.g., ... have not ...

types of error:
  type I: erroneous rejection of true H_0 (prob alpha)
  type II: erroneous failure to reject false H_0 (prob beta)

  alpha = prob of type I error = significance level of test
  controlled by experimenter, depending on acceptable risk of type I error 
  common values: 0.05 and 0.01

  confidence level = 1 - alpha

  beta = prob of type II error

  power = 1-beta

general testing procedure:
- set alpha
- compute rejection region C of test statistic from alpha
- compute test statistic T from data
- if T in C, then reject H_0
  else, do not reject H_0

p-value (significance probability):
  probability of obtaining a value for test statistic that 
  is at least as extreme as observed value, assuming H_0 is true

=> checking T in C is equivalent to checking p-value < alpha:

NOTE: failure to reject H_0 does not prove H_0 is true 
	at any confidence level! (type II errors)

--
General issue: testing the significance of correlations

Spearman’s rank order test
given: two samples
H_0: no significant (monotone) relationship between samples
[TODO: check!]
statistic: based on Spearman rank order correlation coefficient
	= Pearson corr coeff on the ranks of the (ordered) data
note: does not require normality assumption,
	is not restricted to linear correlation
	(unlike tests based on Pearson corr coeff)
in R: cor.test(corr$V1,corr$V2, method="spear")

[Similar: Kendall's tau test; in R: cor.test(corr$V1,corr$V2, method="kendall")]

Note: These non-parametric tests are preferable over tests based on the Pearson 
correlation coefficient, which require the assumption that the SCDs are normal distributions.

--
Analyse scaling of performance with instance size:

- measure performance for various instance sizes

- exploratory analysis: 
	- use scaling plot for inititial visual analysis
	(note: log / loglog plots can be very useful in this context)
	- fit one ore more parametric models to data points
	(e.g., \alpha * e^(\beta*x), a * x^b, ...
	using continuous optimisation technique 
	(e.g., 'fit' function in gnuplot - caveat: local minima, divergence!)

	Practical tip: check all fits visually; fit multiple times with 
	different initial values for the parameters, encouraging the optimiser
	to approach best values from below and above.

	- RMSE (= root mean squared error = rms of residuals) can be used
	for an initial assessment of the relative fit of various models.

[see m2-scaling.pdf]

- confirmatory analysis: 
	- challenge model by interpolation or extrapolation
	(i.e., compare predictions obtained from model against actual data
		- the latter data must not have been used previously for fitting the model)

	- use bootstrapping for checking statistical significance of differences/agreement
	between predictions and observations from previous step


bootstrapping for scaling analysis:
	given performance measures for m problem instances per size,
	for k times:
	  draw performance measures for l<m instances per size
		uniformly at random, with replacement (subsampling)
	  fit parametric models for the resulting data set
	  record parameter values for fit and extrapolation/interpolation predictions
	end for

- this results in empirical distributions over parameter values and predictions (based on
	k values each, one from each iteration of the bootstrapping procedure)
- the distributions over the parameters give an indication to which extent the model
	depends on randomness in the initial selection of instances, i.e., how meaningful
	the model parameter values of the fit on the entire data (m instances per size) is
- the distributions over the predictions can be used to assess the validity of the model,
	e.g., using the bootstrap quantile interval,
	which is the interval between the alpha/2 and 1-alpha/2 quantile of the bootstrap
	prediction distribution - this is essentially a (bootstrap) confidence interval
	for the prediction; if the observed value is within this interval for some 
	reasonable alpha (e.g., alpha=0.05), this is an indication that there is no reason
	to reject the model based on the prediction quality.
	NOTE: as always, the power of the test depends on sample size.
	Practical suggestions:
	- use l = m/2 (assuming that m is large enough that m/2 is still a reasonable sample
	size) to ensure that many different subsampes can be generated
	- use k large enough that the CDFs of the distributions underlying the confidence
	interval look reasonably smooth (always look at the distributions!)

[case study: scaling analysis for the Concorde (complete) TSP solver on RUE instances
	- Applegate et al. (2006), Hoos & Stuetzle (work in progress)]


---
2.5 Empirical analysis of multiple algorithms

Goal: Testing the significance of performance differences
	for algorithms A and B

Assumption: Test instances drawn from (some) random distribution.

Hypothesis: Median of paired differences is significantly different
	from 0 (i.e., algorithm A better than B or vice versa)

Test: Wilcoxon matched pairs signed-rank test 
	in R: wilcox.test(corr$V1,corr$V2, paired=TRUE)
	(alternatively, binomial sign)


Problem: This may easily miss substantial differences in the distributions not reflected in the median
	(e.g., different variability of run-time)

Solution: Study entire SCDs (e.g., graphically compare CDFs); can also use tests (see Module 3)


Goal: Detect performance correlation (e.g., as indication of intrinsic instance hardness)
	for algorithms A and B

Use same methods as in 2.4 (e.g., correlation plots, Spearman's rank order / Kendall's tau test)
[see m2-corr.pdf]


Def: An algorithm contributes to the state of the art in solving a given problem if
it performs significantly better than all other algrithms on a relevant set of instances.
(What is relevant is an expert judgement.)


---
2.6 Impact of parameter settings on performance

Categorical parameters (or small number of distinct values):
use same methods as for different algorithms

Continuous parameters (or large number of discrete values):
use same methods as for impact of instance properties


General issue: Peak/average performance vs. robustness trade-off 
	(depends on application context)


---
learning goals (for module 2):
- be able to explain the concepts of decision problems and deterministic decision algorithms
- be able to provide reasonable specifications of execution environments and implementations
- understand and be able to explain criteria for selecting benchmark sets
- be able to use the concept of operation counts and cost models for measuring run-time
- be able to use appropriate techniques for analysing and summarising probability distributions,
	in particular search cost distributions
- be able to explain and apply the following concepts: histogram, CDF, quantile, box plots, 
	mode, outlier, heavy-tailed distribution
- be able to use and justift various plotting techniques for exploratory analysis of performance data of one
	or more deterministic decision algorithm
- be able to explain the basic concept of statistical hypothesis testing, including the concepts
	of null-hypothesis, significance / confidence level, power, p-value
- be able to name and use appropriate tests for significance of performance differences and correlations
- be able to explain and perform scaling analyses, including the use of bootstrapping
	to assess the quality of models

<eof>