Assignment 4

416 Distributed Systems: Assignment 4

Due: Feb 27 at 9PM

2016W2, Winter 2017

In this assignment you will use the Azure cloud to build a global distributed measurement platform to evaluate website performance and to detect regional content variation.

High-level description

The Azure cloud is composed of several data-centers, located world-wide. Each data center hosts many thousands of machines that can be used (i.e., rented) for a price. In this assignment you will build a distributed measurement system and test it by deploying it on Azure cloud.

The purpose of your system is to use the Azure data-centers as different vantage points from which you can measure website performance and also regional website content variation. For example, your system will measure the latency to fetch the www.facebook.com page from around the world. It will also reveal whether or not the page content retrieved from these different locations varies. Your system will indicate whether or not there is regional specialization of website content. Although you will test and deploy your system on Azure, it will have nothing Azure-specific about it, e.g., it should be possible to run your system on the CS ugrad servers without any code modifications.

Your system will use a server process to coordinate several worker processes that run across the Azure cloud in different data centers. The server will receive jobs submissions from clients through an RPC-based API where a job is one specific website target. To service a client's job, the server must coordinate with worker processes, and then report the measurement results back to the client.

Using Azure: stop VMs when not using

We prepared a google slides presentation covering the basic workflow of getting a VM running on Azure for this/future assignments. To setup the Go environment in a VM you can use the azureinstall.sh script.

The default Azure subscription comes with a limitation of 20 cores per region. For this assignment you should not need cores above this limit. Though you can issue

A Piazza post covered some basics details of Azure. The most important of which is that each second that your VM is running it is draining your balance (yikes!). You should STOP your VMs when you are not using them. It's up to you to police your own account.

The details

As in the previous two assignments we provide interfaces that your system must use to communicate with the outside world (RPC to the server and HTTP protocol to websites). Internal communication and coordination between the server and the worker processes is up to you to design and implement.

The server must listen to connections from one client as well as several workers. The server must be able to handle an arbitrary number of workers. The server must handle exactly one client that connects and invokes several RPCs. The client can invoke two kinds of RPCs against the server:

  • website-stats ← MeasureWebsite(URI, samplesPerWorker)
    • Instructs the system's workers to perform distributed measurements to URI. Each worker should perform samplesPerWorker number of measurements. Returns a data structure containing the IP of each worker and the min/median/max latency to retrieve URI by the worker. See client code below for more details.
  • worker-stats ← GetWorkers(samplesPerWorker)
    • Instructs the system to perform distributed measurements from each of the workers to the server. Each worker should perform samplesPerWorker number of measurements. Returns a data structure containing the IP of each worker and the min/median/max round-trip latency between worker and the server. See client code below for more details.

Latency to download the page from a worker: The MeasureWebsite RPC requires measuring latency between a website and a worker. A single latency measurement should be the total time (in milliseconds) that it takes the worker to retrieve the resource at the specified URI by the worker. This time will vary from one request to another. Therefore, the client specifies the samplesPerWorker as part of the RPC, which is the number of times a worker should perform the measurement. The per-worker statistics reported back to the client should be the min/median/max of the set of performed measurements.

RTT latency between worker and server: The GetWorkers RPC requires measuring latency between a worker and the server. This should be the round-trip time (in milliseconds) to ping-pong a message via UDP. As for measuring latency to download a page, the client specifies the samplesPerWorker as part of the RPC, and the per-worker statistics reported back to the client should be the min/median/max of the set of performed measurements.

You can test and debug your solution using the sample client code that we provide to you.

Assumptions you can make

  • The server will start at least 3s before any worker.
  • All workers will start at least 3s before a client will issue a request.
  • Assume that round-trip time latency between any worker and the server is less than 1s.
  • The server and workers do not fail and are always reachable.
  • The client will wait indefinitely to receive a response from the server to either of the two RPC invocations.
  • The website URI specified by the client is reachable by all workers and speaks HTTP over TCP port 80.

Assumptions you cannot make

  • You cannot assume the number of workers or where the worker or the server are located.
  • You cannot assume anything about the target website.

Implementation requirements

  • The client code must be runnable on Azure Ubuntu machines configured with Go 1.7.4 (see the linked azureinstall.sh script and the Google slides presentation for more info).
  • Your solution can only use standard library Go packages.
  • Your solution code must be Gofmt'd using gofmt.

Extra credit

The assignment is extensible with two kinds of extra credit. You must create an EXTRACREDIT.txt file in your repository and specify in this file which extra credit features you have implemented.

  • EC1 (1% of final grade): determine whether there are regional variations in the website content. Your implementation must perform differencing of content retrieved by each pair of workers (pairwise diff). See the MRes.Diff map in the provided client code. The map value should be True for (worker1, worker2) when there is a difference in the retrieved content between worker1 and worker2; otherwise the map should be set to False.
  • EC2 (3% of final grade). Add support for multiple clients that can concurrently connect to the server and then concurrently invoke RPCs against the server. The individual clients should (1) not be aware of other clients' activity, and (2) should not experience a substantial RPC slow-down. Specifically, the server must be able to scale to an RPC throughput of at least 1000 client requests per second while imposing less than 10% latency slowdown on individual RPC requests (as compared to a single client).

Solution spec

Write two go programs called server.go and worker.go that behave according to the description above.

Server process command line usage:

go run server.go [worker-incoming ip:port] [client-incoming ip:port]

  • [worker-incoming ip:port] : the IP:port address that workers use to connect to the server
  • [client-incoming ip:port] : the IP:port address that clients use to connect to the server

Worker process command line usage:

go run worker.go [server ip:port]

  • [server ip:port] : the address and port of the server (its worker-incoming ip:port).

Client code

Download the client code. Please carefully read and follow the RPC data structure comments at top of file.

Rough grading rubric

  • 60%: MeasureWebsite RPC works as expected
  • 40%: GetWorkers RPC works as expected

Make sure to follow the course collaboration policy and refer to the assignments instructions that detail how to submit your solution.