Benchmarking for AC on the Cloud

Benchmarking for AC on the Cloud

Goals

We wish to perform algorithm configuration in a cloud setting. To do this effectively, we need to be able to deal with some issues that are found in cloud computing: the homogeneity and variability of virtual machines. We want to be able to:

determine when a run is unreliable due to VM fluctuations
determine when an instance is no longer reliable as a whole so we can release it
determine how instances differ so as to properly weight and compare runs from different instances
perform all measurements and diagnostics with as little overhead as possible

Overall, we hope to:

devise a method to run algorithm configuration in the cloud to obtain results that are consistent with what we would obtain from running on a local cluster

Ideas

Look at the instances CPU info and make some adjustment based on pre-existing/collected data on that CPU type
- seems likely to provide some benefit, measurements from past research indicate this closely corresponds to performance
- may or may not account for all inter-instance homogeneity, but will not account for temporal variation
Run short algorithm runs multiple times to build confidence in the result
- One paper found that temporal variation was not significant over long periods of time but in the short-term it could be more of a factor
- The shorter the run is the more redundancy we use
Run a large system benchmark (UnixBench?) at the launch of the instance to assess its performance
- may or may not provided an accurate assessment of how it will perform on the target algorithm
- if instances are consistent throughout their lifetime, one accurate measure at launch could be sufficient
Run multiple small benchmarks throughout the lifetime of an instance
- if instances are inconsistent throughout their lifetimes, this method could adjust as it goes
- could detect if variations are too high for our purposes and terminate an instance
Constantly run some form of monitoring program in the background alongside the target algorithm runs
- could provide very fine grained data on the variations of the system
- seems likely to cause its own interference by running alongside the target algorithm
A benchmark should preferably be very similar to the actual program we wish to run. However, this may not be possible
- we can't make any assumptions about the target algorithm because we want to be able to configure any possible algorithm
- the only thing consistent in all cases is the configuration process itself, which we don't care about benchmarking or stabilizing
- we don't know what instructions and operations we will be executing and some machines are better at some things than others (IO, CPU, Disk). So we can't necessarily use a single benchmark/number to scale our results

Experiments

Unix Bench Experiment

This small experiment was conducted to attempt to reproduce variability results found in other papers. It used the UnixBench

benchmark to assess the performance of an instance. UnixBench compiles scores obtained from many other benchmarks testing all aspects of a system and presents it as a single index score.

I tested four instance types: T1.micro, M1.small, M1.medium, M1.xlarge
M1.xlarge instances have 4 cores. Unix bench ran twice on these instances: once using 1 core and once using all 4
four instances of each type were tested to test for homogeneity within instance types
each instance ran the benchmark 4 times to test for variation over time
The entire process was repeated a second time with 16 new instances for additional data and to see if variance occurred between days

The experiment results: UnixBenchResults.xls

Instances perform reliably over time with little (COV < 4%) variation except for micro instances
micro instances vary considerably (COV ~78%), starting strong for the first run and dropping for those after
the micro instance variation is consistent with Amazon's description: "Micro instances (t1.micro) provide a small amount of consistent CPU resources and allow you to increase CPU capacity in short bursts when additional cycles are available."
Variations between instances of the same type is also small (COV < 6%). This contradicts what some papers claimed. However, looking at the underlying CPU types show very little variation
CPUs mostly were Xeon E5-2650 with only a few Xeon E5645
the region/availability zone must be fairly homogeneous

Spear Run Experiment

Another small experiment to test variability of performance. This one focused on testing variations over short periods of time by repeatedly running many small Spear runs and checking the stability of the runtimes.

I tested four instance types: T1.micro, M1.small, M1.medium, M1.xlarge
four instances of each type were tested
each instance ran 80 problems from SW_verification/Wine through Spear
each problem was tested 30 times consecutively to measure for runtime variations
The median completion time for problems was 1.7s (wallclock) so it measured variations at a small scale
Both wall clock and CPU times were taken

The experiment results: SpearResultsWallclock.xls

, SpearResultsCPUTime.xls

CoVs were calculated for all 80 problems on each instance as a measure of the performance variation over the course of the 30 runs
wallclock and CPU time results were very similar
Wallclock CoV (avg, max):
- t1.micro : 64.0%, 145.2%
- m1.small : 2.7%, 15.4%
- m1.medium : 1.0%, 11.7%
- m1.xlarge : 0.9%, 4.1%
Micro instances are very unstable, which is to be expected
other instance types seem to be fairly stable over time
only measured over the course of a few hours in a single region. Possibly a more busy time/region would see worse effects

Related Work

Run a variety of benchmarks (CPU, Memory, Disk, Network) on a large number of small and large EC2 instances over the course of a month
Results are stratified corresponding to the two CPU types: Xeon and Opteron
Xeon backed instances perform about twice as well as Opteron instances
Distribution of processor types varies by availability zone

EC2 Performance Analysis for Resource Provisioning of Service-Oriented Applications

Over a long time (hours) individual instances are stable in performance
Over a short time (minutes) instances are can have sharp dips in performance
Between instances of the same type, average performance can vary by a factor of 4

Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2

Measures the distribution of processor types, both Xeon and Opteron and broken down into specific models
Benchmarks the performance of each processor type
Checks CPU information located in '/proc/cpuinfo'; the VM hypervisor does not modify this
Outlines simple cost analysis of seeking out better performing instances

More for Your Money: Exploiting Performance Heterogeneity in Public Clouds

Examines EC2 variation on three levels: inter-architecture, intra-architecture, and temporal
measures significant variation in each case
proposes both black-box(measured performance) and grey-box(based on knowledge of processor distributions) placement methods
evaluates placement methods in simulations and on EC2

Somewhat Related

How is the Weather tomorrow? Towards a Benchmark for the Cloud

Discusses cloud services in general and what they offer
Lists important metrics to consider in cloud computing and how to go about writing a cloud-wide benchmark to measure these

Benchmarking in the Cloud: What it Should, Can, and Cannot Be

Provides a general overview of benchmarks and what makes a good one
Lists challenges to consider when developing a benchmark for the cloud; focuses on benchmarking entire cloud rather than individual nodes

A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing

Evaluates the EC2 cloud through a series of benchmarks to determine its suitability to scientific computing
Concluded that its performance and reliability were low compared to a dedicated cluster and thus less desirable

Resource Provisioning of Web Applications in Heterogeneous Clouds

Measures heterogeneity of EC2 instances in terms of CPU and memory performance and proposes provisioning instances to tasks that would best suit instances
Focus on web services and maintaining the Service Level Objective

An Evaluation of Amazon's Grid Computing Services: EC2, S3 and SQS

Evaluated EC2 and related services, seeming to focus on data transfers between EC2 and S3
S3 delivers up to 5 times better performance to EC2 than outside locations, but its performance can vary significantly in all cases

Exploring the Performance Fluctuations of HPC Workloads on Clouds

Measures the variability in runtimes of running various solvers on EC2 and FutureGrid
Runtime fluctuation increases when more cores are used for a solver

-- Main.geschd - 05 Aug 2013

Raw edit | More topic actions

Topic revision: r10 - 2013-08-13 - geschd