CPSC 536H: Empirical Algorithmics (Spring 2012)
Notes by Holger H. Hoos, University of British Columbia

-------------------------------------------------------
Module 1: General considerations for empirical analysis
-------------------------------------------------------

Purpose: Discuss some issues that come up in all or most empirical 
	analyses of algorithms (in fact, since design relies analysis,
	design tasks also depend on the same issues)


1.1 How to specify an execution environment?

Execution environment 
= machine M + operating system S + everything that happens on M other than execution of A

Note:
- The execution environment can be difficult to control, which may cause noise in performance measurements
  (e.g., due to other processes in multi-tasking operating systems, cache effects, network activity)


How to specify an execution environment?
[ask students]
- machine parameters: processor make + model + speed, size of cache and RAM;
	sometimes also required: disk size + speed, network type + speed, ...
- operating system make + version, ...

Note:
- concurrent processes are hard or impossible to specify 
  -> try to control (by type of measurement performed - discussed later, 
	restricting access to machine, avoiding heavy loads that may cause
	substantial amounts of page swapping, network traffic, ...)
	

1.2 How to specify an implementation?

[ask students]

- report all non-obvious details that impact performance
  (these may depend substantially on performance measure, see below)
- make source code (or at least executable) publically available
  (to ensure reproducibility of results)
- report programming language and compiler (+ relevant settings, in particular, optimisations)


1.3 How to specify input?

- completely specify problem instances used + all parameter settings

Note:
- problem instances can be specified by reference (to literature or benchmark collection)
  or content (complete set of instance data)
- care should be taken when using instance generators: always specify generator + all input
  (including random number seed for random instance generators)

  Question [ask students]: When using a random instance generator, why is it important to 
	completely specify its input (including seed) rather than just saying something like
	'we used a sample of 100 instances obtained from generator X with parameters P'?
  [Answer: extreme variation, outliers]

- Pitfall: on-line resources can easily change or disappear!


1.4 Benchmark selection

Some criteria for constructing/selecting benchmark sets:
- instance hardness (focus on challenging instances)
- instance size (provide range to facilitate scaling studies)
- instance type (provide variety):
  - individual application instances
  - hand-crafted instances (realistic, artificial)
  - ensembles of instances from random distributions
    (random instance generators)
  - encodings of various other types of problems
    (e.g., SAT-encodings of graph colouring problems)

To ensure comparability and reproducibility of results:
- use established benchmark sets from public benchmark libraries 
  (e.g., SATLIB, TSPLIB, CSPLIB, ORLIB, etc.) and/or related literature;
- make newly created test-sets available to other researchers
  (via submission to benchmark libraries, publication on WWW)

Note: Careful selection and good understanding of benchmark sets
are often crucial for the relevance of an empirical study!


1.5 How to measure performance?

What are typical performance measure we might be interested in?
[ask students]

Note: For now, we will focus on run-time as the primary performance measure.

Wall clock time vs. CPU time [prefer the latter - why?]

To ensure reproducibility and comparability of empirical results, 
CPU times should be measured in a way that is as independent as possible from machine load.

Note:
- CPU timing has limited resolution - beware of floor effects!
  Practical issue: How to report measurements of zero CPU time?	[as '< res']

[Briefly show and discuss timer code for Linux]

To achieve better abstraction from the implementation and run-time environment, 
it is often preferable to measure run-time using

- operation counts that reflect the number of operations that
contribute significantly towards an algorithm's performance,

and

- cost models that specify the CPU time for each such operation
for given input data, implementation and execution environment.

Note:
- cost models typically depend significantly on instance data
  and can depend on algorithm parameters

Pitfalls: 
- counting operations whose costs vary for fixed input data, implementation 
  + execution environment and use these as surrogates for CPU time
- using cost models that do not reflect important (and costly) operations

Additional benefits of operation counts: 
- can achieve better resolution than direct CPU timing
  Question: How can we determine CPU time / operation in that case? [ask students]
  [Answer: measure time for sequence of operations]


1.6 How to report results?

[see Peter Sanders: Presenting Data from Experiments in Algorithmics
(and related work mentioned there)]


1.7 Exploratory vs confirmatory analysis

Most empirical studies have two components:
- exploratory analyses: collect observations, look at data, try to see trends, regularities, patterns
	results: new hypotheses, model modifications, ideas for experiments
	often rather informal, guided by intuition and (some) experience
- confirmatory analyses: carefully design and conduct experiments to answer very specific, technical questions
	results: quantitative answers to questions
	typically very technical, using established statistical methods (e.g., hypotheses tests)

Note:
- In many cases, researchers repeatedly switch between these stages.
- Tools and techniques exist for both types of analyses, but especially exploratory analyses typically require creativity, experience, good judgement.


1.8 Statistical hypothesis tests

null hypothesis (H_0): hypothesis of no change or experimental effect
	e.g., two randomised algorithms have the same mean run-time 
		for given input
	(default outcome of test; this is often the opposite of what we really want to show)

alternative hypothesis (H_a): hypothesis of change or experimental effect
	e.g., ... have not ...

types of error:
  type I: erroneous rejection of true H_0 (prob alpha)
  type II: erroneous failure to reject false H_0 (prob beta)

  alpha = prob of type I error = significance level of test
  controlled by experimenter, depending on acceptable risk of type I error 
  common values: 0.05 and 0.01

  confidence level = 1 - alpha

  beta = prob of type II error

  power = 1-beta

general testing procedure:
- set alpha
- compute rejection region C of test statistic from alpha
- compute test statistic T from data
- if T in C, then reject H_0
  else, do not reject H_0

p-value (significance probability):
  probability of obtaining a value for test statistic that 
  is at least as extreme as observed value, assuming H_0 is true

=> checking T in C is equivalent to checking p-value < alpha:

NOTE: failure to reject H_0 does not prove H_0 is true 
	at any confidence level! (type II errors)

note: the power of a test depends on sample size. the nature of this relationship can be difficult to determine - but power calculations or estimations exist for most standard tests.


Example 1: Spearman’s rank order test 

purpose: test significance of correlation between paired observations
given: two samples (paired)
H_0: no significant (monotone) relationship between samples
statistic: based on Spearman rank order correlation coefficient
	= Pearson corr coeff on the ranks of the (ordered) data
note: does not require normality assumption,
	is not restricted to linear correlation
	(unlike tests based on Pearson corr coeff)
in R: cor.test(corr$V1,corr$V2, method="spear")

[Similar: Kendall's tau test; in R: cor.test(corr$V1,corr$V2, method="kendall")]


This is a non-parametric test, i.e., it does not make any assumption regarding the type of the distribution underlying the samples. (In particular, no assumption of normality is made.)

Non-parametric tests need to be used when distributional assumptions of parametric tests are or may be violated.

Here: Spearman's rank order test and Kendall's tau test are preferable over tests based on the Pearson correlation coefficient, which require the assumption that the samples are normally distributed.

Notes: 
- When using parametric tests, need to use specific tests to check distributional assumptions (e.g., using Shapiro-Wilke test for normality).
- When distributional assumptions are met, parametric tests are usually more powerful than non-parametric tests, i.e., they achieve the same power for smaller sample sizes.


Statistical tests based on resampling methods (data/compute-intensive methods)
- bootstrap
- jackknife
- permutation/randomisation tests
- cross validation

Key idea:
- use a sample drawn from a population of interest to represent that population
- resample from that data (with or without replacement, depending on the specific resampling procedure used) to make inferences about the population, in particular, assess stability of of statistics (e.g., mean, median, ...) and test hypotheses.


Example 2: Permutation test for equality of means

purpose: test significance of differences observed between the means of paired observations
(e.g., of running times)
given: two samples (paired), (x_1,y_1), ..., (x_n, y_n)
H_0: the two distributions underlying the paired observations, X and Y, are the same, and therefore there
	is no significant difference between the means
H_a: mean(X) > mean (Y)  (=> one-sided test)


procedure:

1. compute observed difference d := mean_i{x_i}-mean_i{y_i}; this is the test statistic
2. pool the x_i and y_i into one list L
3. repeat k times
	determine a permutation of L uniformly at random
	use the first half of L as x'_i's and the second half as y'_i's (=> permutation resample)
	compute d' := mean_i{x'_i}-mean_i{y'_i}
4. consider the distribution D of the d's over all permutation resamples (=> permutation distribution)
5. the location of d on D determines the p-value
	one-sided test => empirical probability that values in D are >= d
	[question to students: relation to quantiles?]
6. compare p-value against chosen significance level to determine rejection of H_0

typical number l of resamples: ~100 - ~1000

note: this procedure treats the data effectively as unpaired - not a problem here,
	because comparison of means doesn't depend on pairing.
	equivalent to the method used in step 3 above, and more general (because it 
	preserves the pairing), one can swap x_i,y_i for each i with prob 0.5 independently at random

[question to students: how to investigate two-sided H_a?]

[see also http://bcs.whfreeman.com/ips5e/content/cat_080/pdf/moore14.pdf]


---
learning goals (for module 1):
- be able to adequately specify execution environments, implementations and inputs
- be able to explain and apply guidelines for benchmark selection
- be able to explain and use methods for run-time measurement, including
  CPU timing and cost models based on operation counts
- be able to effectively use tables and graphs for reporting empirical results
- be able to explain the purpose and differences between exploratory and confirmatory analysis
- be able to explain the differences between parametric and non-parametric hypothesis tests
- be able to explain the basic idea behind resampling methods for estimating the precision of sample statistics and for testing hypotheses: bootstrap, permutation test, cross validation
- be able to determine and correctly apply appropriate statistical tests for various standard purposes, including correlation between paired data, equality of means, normality and equality of distributions
<eof>