Tags:
create new tag
view all tags

Benchmarking for AC on the Cloud

Goals

We wish to perform algorithm configuration in a cloud setting. To do this effectively, we need to be able to deal with some issues that are found in cloud computing: the homogeneity and variability of virtual machines. We want to be able to:
  • determine when a run is unreliable due to VM fluctuations
  • determine when an instance is no longer reliable as a whole so we can release it
  • determine how instances differ so as to properly weight and compare runs from different instances
  • perform all measurements and diagnostics with as little overhead as possible
Overall, we hope to:
  • devise a method to run algorithm configuration in the cloud to obtain results that are consistent with what we would obtain from running on a local cluster

Ideas

  • Look at the instances CPU info and make some adjustment based on pre-existing/collected data on that CPU type
    • seems likely to provide some benefit, measurements from past research indicate this closely corresponds to performance
    • may or may not account for all inter-instance homogeneity, but will not account for temporal variation
  • Run short algorithm runs multiple times to build confidence in the result
    • One paper found that temporal variation was not significant over long periods of time but in the short-term it could be more of a factor
    • The shorter the run is the more redundancy we use
  • Run a large system benchmark (UnixBench?) at the launch of the instance to assess its performance
    • may or may not provided an accurate assessment of how it will perform on the target algorithm
    • if instances are consistent throughout their lifetime, one accurate measure at launch could be sufficient
  • Run multiple small benchmarks throughout the lifetime of an instance
    • if instances are inconsistent throughout their lifetimes, this method could adjust as it goes
    • could detect if variations are too high for our purposes and terminate an instance
  • Constantly run some form of monitoring program in the background alongside the target algorithm runs
    • could provide very fine grained data on the variations of the system
    • seems likely to cause its own interference by running alongside the target algorithm
  • A benchmark should preferably be very similar to the actual program we wish to run. However, this may not be possible
    • we can't make any assumptions about the target algorithm because we want to be able to configure any possible algorithm
    • the only thing consistent in all cases is the configuration process itself, which we don't care about benchmarking or stabilizing
    • we don't know what instructions and operations we will be executing and some machines are better at some things than others (IO, CPU, Disk). So we can't necessarily use a single benchmark/number to scale our results

Experiments

Unix Bench Experiment

This small experiment was conducted to attempt to reproduce variability results found in other papers. It used the UnixBench benchmark to assess the performance of an instance. UnixBench compiles scores obtained from many other benchmarks testing all aspects of a system and presents it as a single index score.
  • I tested four instance types: T1.micro, M1.small, M1.medium, M1.xlarge
  • M1.xlarge instances have 4 cores. Unix bench ran twice on these instances: once using 1 core and once using all 4
  • four instances of each type were tested to test for homogeneity within instance types
  • each instance ran the benchmark 4 times to test for variation over time
  • The entire process was repeated a second time with 16 new instances for additional data and to see if variance occurred between days
The experiment results: UnixBenchResults.xls
  • Instances perform reliably over time with little (COV < 4%) variation except for micro instances
  • micro instances vary considerably (COV ~78%), starting strong for the first run and dropping for those after
  • the micro instance variation is consistent with Amazon's description: "Micro instances (t1.micro) provide a small amount of consistent CPU resources and allow you to increase CPU capacity in short bursts when additional cycles are available."
  • Variations between instances of the same type is also small (COV < 6%). This contradicts what some papers claimed. However, looking at the underlying CPU types show very little variation
  • CPUs mostly were Xeon E5-2650 with only a few Xeon E5645
  • the region/availability zone must be fairly homogeneous

Spear Run Experiment

Another small experiment to test variability of performance. This one focused on testing variations over short periods of time by repeatedly running many small Spear runs and checking the stability of the runtimes.
  • I tested four instance types: T1.micro, M1.small, M1.medium, M1.xlarge
  • four instances of each type were tested
  • each instance ran 80 problems from SW_verification/Wine through Spear
  • each problem was tested 30 times consecutively to measure for runtime variations
  • The median completion time for problems was 1.7s (wallclock) so it measured variations at a small scale
  • Both wall clock and CPU times were taken
The experiment results: SpearResultsWallclock.xls, SpearResultsCPUTime.xls
  • CoVs were calculated for all 80 problems on each instance as a measure of the performance variation over the course of the 30 runs
  • wallclock and CPU time results were very similar
  • Wallclock CoV (avg, max):
    • t1.micro : 64.0%, 145.2%
    • m1.small : 2.7%, 15.4%
    • m1.medium : 1.0%, 11.7%
    • m1.xlarge : 0.9%, 4.1%
  • Micro instances are very unstable, which is to be expected
  • other instance types seem to be fairly stable over time
  • only measured over the course of a few hours in a single region. Possibly a more busy time/region would see worse effects

Related Work

Closely Related

Runtime Measurements in the Cloud: Observing, Analysing, and Reducing Variance
  • Run a variety of benchmarks (CPU, Memory, Disk, Network) on a large number of small and large EC2 instances over the course of a month
  • Results are stratified corresponding to the two CPU types: Xeon and Opteron
  • Xeon backed instances perform about twice as well as Opteron instances
  • Distribution of processor types varies by availability zone
EC2 Performance Analysis for Resource Provisioning of Service-Oriented Applications
  • Over a long time (hours) individual instances are stable in performance
  • Over a short time (minutes) instances are can have sharp dips in performance
  • Between instances of the same type, average performance can vary by a factor of 4
Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2
  • Measures the distribution of processor types, both Xeon and Opteron and broken down into specific models
  • Benchmarks the performance of each processor type
  • Checks CPU information located in '/proc/cpuinfo'; the VM hypervisor does not modify this
  • Outlines simple cost analysis of seeking out better performing instances
More for Your Money: Exploiting Performance Heterogeneity in Public Clouds
  • Examines EC2 variation on three levels: inter-architecture, intra-architecture, and temporal
  • measures significant variation in each case
  • proposes both black-box(measured performance) and grey-box(based on knowledge of processor distributions) placement methods
  • evaluates placement methods in simulations and on EC2

Somewhat Related

How is the Weather tomorrow? Towards a Benchmark for the Cloud
  • Discusses cloud services in general and what they offer
  • Lists important metrics to consider in cloud computing and how to go about writing a cloud-wide benchmark to measure these
Benchmarking in the Cloud: What it Should, Can, and Cannot Be
  • Provides a general overview of benchmarks and what makes a good one
  • Lists challenges to consider when developing a benchmark for the cloud; focuses on benchmarking entire cloud rather than individual nodes
A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing
  • Evaluates the EC2 cloud through a series of benchmarks to determine its suitability to scientific computing
  • Concluded that its performance and reliability were low compared to a dedicated cluster and thus less desirable
Resource Provisioning of Web Applications in Heterogeneous Clouds
  • Measures heterogeneity of EC2 instances in terms of CPU and memory performance and proposes provisioning instances to tasks that would best suit instances
  • Focus on web services and maintaining the Service Level Objective
An Evaluation of Amazon's Grid Computing Services: EC2, S3 and SQS
  • Evaluated EC2 and related services, seeming to focus on data transfers between EC2 and S3
  • S3 delivers up to 5 times better performance to EC2 than outside locations, but its performance can vary significantly in all cases
Exploring the Performance Fluctuations of HPC Workloads on Clouds
  • Measures the variability in runtimes of running various solvers on EC2 and FutureGrid
  • Runtime fluctuation increases when more cores are used for a solver

-- Main.geschd - 05 Aug 2013

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View |  Raw edit | More topic actions
Topic revision: r10 - 2013-08-13 - geschd
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback