CPSC 536H: Empirical Algorithmics (Spring 2012)
Notes by Holger H. Hoos, University of British Columbia

---------------------------------------------------------------------------------------------
Module 2: Deterministic decision procedures 
---------------------------------------------------------------------------------------------

2.1 Deterministic decision algorithms

Given: Input data (e.g., graph G and an integer, k)

Objective: Output “yes” or “no” answer (e.g., to the question 
"is there a clique of size k in G")

Other examples:
- primality: given an integer n, is n a prime number?
- SAT: given a propositional formula F, is there an assignment a of truth values to the variables in F such that F is true under a?
- scheduling: given a set of resources R, a set of tasks T with resource requirements and a time t, 
	can all tasks in T be accomplished within time t?
 
Note: Formally, decision problems can be represented by the set of all "yes" instances, i.e., the set of
	input data for which the answer is "yes"
	(this set may be hard or impossible to compute)

A decision algorithm is an algorithm for solving a given decision problem.
A decision algorithm is called deterministic if its behaviour is completely determined by the given input data.

Here: 
- Consider only error-free decision algorithms, i.e., algorithms that never give an incorrect answer.
- Consider only algorithms that terminate on every given problem instance.
- Focus on measuring performance of algorithm (typically run-time required for solving a given problem instance, 
  but can also be consumption of other resources, in particular, memory)

Note: Concepts and techniques discussed here apply to all algorithms for which we are only interested
	in analysing running time (or similar single resource consumed) when applied to given inputs.


---
2.2 Empirical analysis of a single decision algorithm

For a given instance: 
- run algorithm and measure performance

Issues: 
- choose performance measure
- control execution environment (see above)


For a set of problem instances: 
- on each instance, run algorithm and measure performance
- analyse results


Solution cost distributions (SCDs):
- distribution of solution cost (running time) over a set of problem instances
- for instances obtained form random generator: SCD = empirical probability distribution


General issue: Analysing and summarising distributions

Graphical representations:
- CDFs (cumulative distribution functions) vs PDFs (probability density functions)
  - PDFs are typically preferable [Why? Ask students. Answer: ...]
- use of log plots
  [examples in gnuplot]
- modes
  [What do modes mean? Ask students.]
- heavy (=long) vs. fat tails
	Def: rand var X has a heavy tail on the right, iff P(X > x) \sim x^(-\alpha)  with 0 < \alpha < 2
	=> power-law decay of the right tail
  fat tail = tails fatter than in a Gaussian (=normal distr) [as measured by kurtosis]
  [What do fat / heavy tails mean? Ask students.]
	
  Notes: heavy-tailed distributions
  - have infinite mean, can have infinite variance -> instability of sample means, ...
  - have been used for modelling many phenomenae including network traffic
	and behaviour of search algorithms for NP-hard problems
  - are closely related to self-similar phenomenae, self-similar structures
  - heavy left tails??
  [see also: http://en.wikipedia.org/wiki/The_long_tail, http://en.wikipedia.org/wiki/Long-range_dependency]
- outliers - def: x is an outlier if it is more than 1.5 times inter-quartile range (IQR) 
	from closest quartile, i.e., min{|x-q_0.75|,|q_0.25-x|}>= 1.5*(q_0.75-q_0.25)


Descriptive statistics (summarise distribution):

- location: mean, median, quantiles

  med for even number of samples = avg of middle two values (or rounded up)

  Def: p-quantile q_p of random var X = value x s.t. P(X <= x) >= p) *and* P(X >= x) <= 1-p 
	=> q_0 = min, q_1 = max 
  estimates for sample quantiles: various algorithms (rounding, interpolation) - not much of an issue for large samples
  frequently used quantiles: q_0.5 = median, q_0.25, q_0.75 = quartiles 
  quantiles are often preferable over means 
  [Why? Ask students. Answer: stat stability]

- spread: var / stddev, quantile ranges, quantile ratios
  quantile ranges or ratios often preferable over var / stddev

- higher moments: 
  - (sample) skewness (3rd moment) = measure of asymmetry
	sqrt(n) * sum_1..N(x_i - mean(x))^3 / [sum_1..N(x_i - mean(x))^2]^3/2 
  - (sample) kurtosis (4th moment) = measure of 'peakedness' (also reflects 'fatness' of tails)


Box plot (Tukey, 1977):
- box = q_0.25, q_75 (quartiles), line = median, 
	whiskers = q_0.25-1.5*IQR, 
	points = outliers (all of them)
[Draw illustration]
[see also http://web2.concordia.ca/Quality/tools/4boxplots.pdf]

Note: 
- sometimes, extreme outliers are distinguished from mild ones, where e.o. are 
	more than 3*IQR from closest quartile
- variations of the concept exist, e.g., def of whiskers, additional indication of mean, ...
	

---
general issues: 
fundamental differences between normally distributed data
and other types of distributions:

fraction of values within k*stddev from mean:

normal distribution: 68-95-99.7 rule:

k 	[]	][
1	0.6827	0.3173
2	0.9545	0.0455
3	0.9973	0.0017
4	0.9999	0.0001

(note: closely related to error function = erf(n/sqrt(2))

[]	k
0.8	1.2816
0.9	1.6449
0.95	1.9600
0.99	2.5758
0.999	3.2905
0.9999	3.8906
0.99999	4.4172


exp distribution Exp[1/lambda]	(CDF: F(lambda) = 1-exp(-lambda*x))

k 	<=	>
1	0.6321	0.3679
2	0.8647	0.1353
3	0.9817	0.0183
4	0.9933	0.0067

note: much higher likelihood of observing extreme values!


See also 
- http://en.wikipedia.org/wiki/Exponential_distribution
- http://en.wikipedia.org/wiki/Normal_distribution
--


Example: Run-time data from a study of a heuristic MAX-CLIQUE algorithm from Franco Mascia (Univ Trento)
[Show plot]

Note: SCDs on hard combinatorial problems are often not normally distributed,
	often have very high variance, long tail(s)
	-> be careful when using statistical tests!!

Characterisation by means of known, parametric distributions (function fitting) 
[discuss only briefly here]

When to summarise results on benchmark sets? [Ask students.]
- when instances come from a distribution (e.g., random number generator, other stochastic process)
- when dealing with a large number of instances
- caution: when looking at summary statistics only, it is sometimes easy to miss 
	important effects - in particular, when summarising over heterogeneous test sets


---
2.3 Correlation between instance properties and performance

Goal: analyse / characterise impact of instance properties on performance of algorithm

Simple qualitative analysis: Plot correlation between given instance property and performance, one data point per instance (scatter plot)

Simple quantitative analysis: standard (Pearson) correlation coefficient
	measures linear correlation only
	|r| = 1 <=> perfect linear correlation
	|r| = 0 <=> no correlation
	can use nonlinear transformations (particularly log, loglog) to test non-linear dependencies

Question: When is an observed correlation statistically significant?
	=> use statistical hypothesis test to assess significance

--
Analyse scaling of performance with instance size:

- measure performance for various instance sizes

- exploratory analysis: 
	- use scaling plot for inititial visual analysis
	(note: log / loglog plots can be very useful in this context)
	- fit one ore more parametric models to data points
	(e.g., \alpha * e^(\beta*x), a * x^b, ...
	using continuous optimisation technique 
	(e.g., 'fit' function in gnuplot - caveat: local minima, divergence!)

	Practical tip: check all fits visually; fit multiple times with 
	different initial values for the parameters, encouraging the optimiser
	to approach best values from below and above.

	- RMSE (= root mean squared error = rms of residuals) can be used
	for an initial assessment of the relative fit of various models.

- confirmatory analysis: 
	- challenge model by interpolation or extrapolation
	(i.e., compare predictions obtained from model against actual data
		- the latter data must not have been used previously for fitting the model)

	- use bootstrapping for checking statistical significance of differences/agreement
	between predictions and observations from previous step


bootstrapping for scaling analysis:
	given performance measures for m problem instances per size,
	for k times:
	  draw performance measures for l<m instances per size
		uniformly at random, with replacement (subsampling)
	  fit parametric models for the resulting data set
	  record parameter values for fit and extrapolation/interpolation predictions
	end for

- this results in empirical distributions over parameter values and predictions (based on
	k values each, one from each iteration of the bootstrapping procedure)
- the distributions over the parameters give an indication to which extent the model
	depends on randomness in the initial selection of instances, i.e., how meaningful
	the model parameter values of the fit on the entire data (m instances per size) is
- the distributions over the predictions can be used to assess the validity of the model,
	e.g., using the bootstrap quantile interval,
	which is the interval between the alpha/2 and 1-alpha/2 quantile of the bootstrap
	prediction distribution - this is essentially a (bootstrap) confidence interval
	for the prediction; if the observed value is within this interval for some 
	reasonable alpha (e.g., alpha=0.05), this is an indication that there is no reason
	to reject the model based on the prediction quality.
	NOTE: as always, the power of the test depends on sample size.
	Practical suggestions:
	- use l = m/2 (assuming that m is large enough that m/2 is still a reasonable sample
	size) to ensure that many different subsampes can be generated
	- use k large enough that the CDFs of the distributions underlying the confidence
	interval look reasonably smooth (always look at the distributions!)

[case study: scaling analysis for the Concorde (complete) TSP solver on RUE instances
	- Applegate et al. (2006), Hoos & Stuetzle (work in progress)]


---
2.4 Empirical analysis of multiple decision algorithms

Goal: Testing the significance of performance differences
	for algorithms A and B

Assumption: Test instances drawn from (some) random distribution.

Hypothesis: Median of paired differences is significantly different
	from 0 (i.e., algorithm A better than B or vice versa)

Test: Wilcoxon matched pairs signed-rank test 
	in R: wilcox.test(corr$V1,corr$V2, paired=TRUE)
	(alternatively, binomial sign)


Problem: This may easily miss substantial differences in the distributions not reflected in the median	(e.g., different variability of run-time)

Solution: Study entire SCDs (e.g., graphically compare CDFs); can also use tests (see Module 3)


Goal: Detect performance correlation (e.g., as indication of intrinsic instance hardness) for algorithms A and B

Use same methods as in 2.4 (e.g., correlation plots, Spearman's rank order / Kendall's tau test)


Def: An algorithm contributes to the state of the art in solving a given problem if
it performs significantly better than all other algrithms on a relevant set of instances.
(What is relevant is an expert judgement.)


---
2.5 Impact of parameter settings on performance

Categorical parameters (or small number of distinct values):
use same methods as for different algorithms

Continuous parameters (or large number of discrete values):
use same methods as for impact of instance properties


General issue: Peak/average performance vs. robustness trade-off 
	(depends on application context)


---
learning goals (for module 2):
- be able to explain the concepts of decision problems and deterministic decision algorithms
- be able to use appropriate techniques for analysing and summarising probability distributions, in particular solution cost distributions
- be able to explain and apply the following concepts: histogram, CDF, quantile, box plots, mode, outlier, heavy-tailed distribution
- be able to use and justift various plotting techniques for exploratory analysis of performance data of one	or more deterministic decision algorithm
- be able to explain the basic concept of statistical hypothesis testing, including the concepts of null-hypothesis, significance / confidence level, power, p-value
- be able to name and use appropriate tests for significance of performance differences and correlations
- be able to explain and perform scaling analyses, including the use of bootstrapping
	to assess the quality of models

<eof>