Assignment 2

416 Distributed Systems: Assignment 2

Due: October 2nd at 11:59pm

Fall 2018

In this assignment, you will implement a restartable web cache. A web cache is a type of a proxy server that stores frequently accessed web content to reduce the latency observed by browser clients.

You will design, implement, deploy, and test a static web content cache on the Azure cloud. Your web cache must interoperate with the Firefox browser and it must be restartable (it must store cached content in memory and on disk).

Background

Internet users with similar interests often access and download the same content. A single user will also often access the same online resources repeatedly. Without a web cache, every time a user requests some content, the response must come from the server hosting the content. If many users are accessing the same content at the same time, the response time may increase and cause an overload on the server. A web cache handles requests for popular content that would otherwise be directed to the server, thereby preventing server overload and decreasing the response time for users. Here is a high-level diagram that illustrates how a web cache works in practice.

Cache Policies Web caches are highly configurable. Configurations are manifested in the form of policies, which dictate how the web cache should behave in different situations. Examples include which content to cache, which content to evict and when, and whether or not to share cached data between clients. Note that since multiple clients may be using the same web cache, a web cache may present security and privacy concern, e.g., they may be susceptible to timing attacks that leak information about accessed content.

Web cache policies can be configured in many different ways and usually depend on the content. In this assignment, we will tell you what policies your web cache should use via command-line arguments.

HTTP also has built in cache policy parameters, which you will learn about if you choose to attempt the extra credit EC3. The parameters allow clients and servers to have more fine-grained control over caches that attempt to cache specific web page content (of course, a web cache may decide to ignore these directives).

Cache Updates Additionally, web caches must make decisions about when to update data stored in the cache. There are many policies for handling updates. For example, the web cache may wait for the user to explicitly say when to update the data. In this assignment, we will have a notion of expiration to determine whether to update the content or not.

HTTP is the protocol used by the browser to request content from the web cache, and by web servers to respond with the content. There are two types of HTTP messages: requests and responses. Your web cache will handle both of them.

HTTP Requests are made by clients to request some service from the server. The two most common types of HTTP requests, GET and POST, are used to request data from/send data to the server.

HTTP Responses are how web servers respond to a client request, they contain a response code and sometimes the requested data (or an error message).

Overview

In this assignment, you will design a web cache that caches and serves static web content retrieved by a browser using HTTP GETs. Your web cache must be able to serve multiple clients concurrently and must have persistent state to recover from crashes or restarts. Your web cache does not need to support HTTPS or dynamic content.

High-level design

Browser. In this assignment we expect you to use the Firefox browser for development and for your demo. You will need to change the Firefox proxy HTTP settings to point to your web cache (the TCP IP:port of your active web cache instance).

Responding to Requests. The browser will send an HTTP request to the web cache for a URL. The web cache can ignore any requests other than GET requests. That is, the web cache must transparently proxy these requests to the appropriate destination (i.e., issue these requests on behalf of the browser).

The web cache must implement the following logic for GET requests. If the GET is for content that is in the cache (matching on the URL in the GET requests) and this content has not expired, then the web cache must return the cached content. Otherwise, it should do the following:

Parse the destination from the request, forward the request to the destination, and listen for a response. If there are any errors parsing or forwarding the request, return an HTTP response with an error code.
Upon receiving a response, check the response code: if the code is not 200, forward the response to the browser. Otherwise, parse the HTML content in the response, and request all resources linked in the page (explained below), storing all fetched resources in the cache along with the page.
Update the response content by changing resource links to point to resources in the cache.
Finally, return the updated response to the browser.

These steps are detailed further below.

HTTP Message formats HTTP Requests are formatted as follows:

The request line, which specifies the resource and action
Some number of header fields
An empty line ('\r\n')
An optional body

All lines end with '\r\n', and the message ends with two '\r\n's. Your web cache should handle HTTP requests by parsing the destination and forwarding the message to the destination. If you are unable to parse the message, or encounter an error forwarding the message, you should reply with an HTTP error response (e.g., HTTP 500).

HTTP responses are formatted as follows:

The status line, which includes the response code and message
Some number of header fields
An empty line ('\r\n')
An optional body

All lines end with '\r\n', and the message ends with two '\r\n's. If the response code is anything other than 200, you should forward the response back to the browser and have it deal with it. HTTP responses will contain the actual web page which you must cache.

HTML Parsing. To obtain the resources associated with an HTML page, the web cache must parse through the HTML page and find all the resources it needs to host to update the links correctly in the HTML page returned back to the client. The web cache should cache the resources with the page. You must only parse and obtain resources pointed to by attributes (src/href) of the following HTML tags:

Note that your implementation does not have to handle relative URLs.

Persistence. In the case of restarts or crashes, the web cache must be able to build the cache completely from disk once it is started up again. This requires that the cached content is written to disk. Your cache must guarantee that any content that is cached in memory is also cached on disk. For example, the updated pages must be written to the disk before they are sent back to the user as responses.

Cache Eviction and Replacement. If the cache is full and a new item must be added, a cached item must be chosen to be replaced according to the cache replacement policy. Your web cache must have support for the following cache replacement policies, which will be specified as command line arguments:

Note that your cache must consistently implement these policies for the in-memory cache and the on-disk cache (they must be mirrors of each other).

Item expiration. If an item in the cache expires (has been originally fetched from the source longer than expiration_time ago), then the web cache should delete the item from the cache. The client should never observe expired content. If the client issues a GET for content that expired, the web cache should re-fetch the content anew.

Here is what the protocol will look like in terms of flow of messages:

Note that in the above diagram index.html does not have any other linked resources to be cached. Here is how the flow differ if, for example, index.html had an image. In this case the cache would (1) retrieve the image after receiving index.html from www (parsing index.html to find the linked image resource), (2) cache the image in memory and on disk, (3) update/rewrite the image URL in the index.html to point to the image in the cache, and (4) only then deliver index.html back to the Client. After this point, the Client would fetch the image from the web cache (and not www). Here is an illustration of these steps:

The above diagram has a single sync after all the resources are fetched and the html is rewritten. This is one way to sync to the disk: after all of the resources are fetched. There are alternative designs. For example, you could sync to disk after retrieving each individual resource.

There be dragons. This assignment is easy to describe, but under the surface there are many challenges that you will have to navigate with your team. These challenges are deliberately under- or completely un-specified in this assignment. This is because we want you to solve them on your own, in your own unique way. Make sure that you start early and that you think through your design in high detail before embarking on an implementation.

Your web cache must somehow represent cached content in memory. You should think about an appropriate set of data structures for this.
Your web cache must store cached content on disk. The simplest solution is to store content as files, but you should think about how you will name these files, what directory structure you will use, etc. You can also use an external data store (database? key-value store?). Just make sure that it runs on a single node (the same one as the web cache).
Your web cache must support multiple concurrent clients. A1 prepared you for this. You will need to use go routines (perhaps one per TCP connection), channels to coordinate them, and go routines to coordinate disk access.
Your web cache must rewrite URLs to content that it cached (e.g., the image example just above). You must determine a URL schema for cached content and a naming convention for cached items (SHA-256 to the rescue?).
There are numerous concurrency conditions that you should think through. For example, a browser may access index.html and receive an updated version with re-written links. Later (in one hour) the user refreshes the page, but your cache no longer holds the content (it expired). In this case the cache must still (correctly) deliver the page to the user.
Your cache may restart at any time (we will do this in the demo). It is critical that your state on disk is consistent (what happens if you only wrote half of index.html before restarting?). Your cache cannot ever enter a corrupted state and it cannot deliver wrong/corrupted content to the client.

Suggested Approach

We suggest the following step-by-step approach for successfully completing the assignment:

Step 1: Implement a web cache that just forwards the content for the requests (a simple proxy).
Step 2: Modify your web cache to cache/store all the HTML content in memory, without regard for policy.
Step 3: Implement HTML parsing to parse, obtain, store, and host all the other static content.
Step 4: Implement cache eviction policies.
Step 5: Store the cached content on disk and implement code to rebuild the cache during restarts.

It is a good idea to abstract out the implementation of policy-dependent behavior so it can easily be changed depending on the specific policy.

Azure Deployment

The Azure cloud is composed of several data-centers, located world-wide. Each data center hosts many thousands of machines that can be used (i.e., rented) for a price. In this assignment you may use the Azure cloud to deploy and test your solution in the wide area. That is, you will deploy your web cache in a VM on Azure. The browser client does not need to be deployed on Azure (just run it on your personal machine).

Although you will test and deploy your system on Azure, it will have nothing Azure-specific about it, e.g., it should be possible to run your system on the CS ugrad servers without any code modifications (though we will not test this).

Using Azure: stop VMs when not using

We prepared a google slides presentation covering the basic workflow of getting a VM running on Azure for this/future assignments. To setup the Go environment in a VM you can use the azureinstall.sh script.

The default Azure subscription comes with a limitation of 20 cores per region. For this assignment you should not need cores above this limit.

Use this site to check your account balance.

Access information will be posted to piazza.

A key detail is that each second that your VM is running it is draining your balance (yikes!). You should STOP your VMs when you are not using them. It's up to you to police your own account.

Implementation Requirements

The web cache code must be runnable on Azure Ubuntu machines configured with Go 1.9.7
Your solution can only use standard library Go packages except for the HTML library mentioned below and for (optionally) external data store for on-disk storage. Talk to us if you want to use other dependencies.
We highly recommend that you use the HTML library for parsing HTML in your web cache.
Your solution code must be Gofmt'd using gofmt

Solution Spec

Write a go program called webcache.go that behaves according to the description above.

WebCache's recommended command line usage:

go run web-cache.go [ip1:port1] [ip2:port2] [replacement_policy] [cache_size] [expiration_time]

[ip1:port1] : The TCP IP address and the port that the web cache will bind to to accept connections from clients. The web cache should also bind to ip1 when connecting to remote web servers to retrieve resources on behalf of clients.
[ip2:port2] : The TCP IP address and the port that the web cache should use when rewriting the HTML. For example, the web cache would rewrite <img src="http://foo.com/image.jpg"/> to <img src="http://ip2:port2/URL"/> (using whatever scheme for URL that you have implemented). In your demo, port1 will be equal to port2.
[replacement_policy] : The replacement policy ("LRU" or "LFU") that the web cache follows during eviction.
[cache_size] : The capacity of the cache in MB (your cache cannot use more than this amount of capacity). Note that this specifies the (same) capacity for both the memory cache and the disk cache.
[expiration_time] : The time period in seconds after which an item in the cache is considered to be expired.

The above arguments are required. You may include other arguments, but please check with us first. You may also use a config file or some other way of passing arguments to your web cache, and you may decide to start your web cache using a different variant of the 'go run' command.

Grading Scheme

We will follow a Demo style grading scheme. At the high level your mark for this assignment is 15% of your final mark and has these components:

Code: 10%
Demo: 10%

Note that the demo actually exercises both the Code and the Demo portion of your mark. So, the best way of thinking about the demo rubric below is that it is 100% of your mark. Of course, we reserve the right to look at your code on our own if we want to to further convince ourselves about some functional aspect or check that you do indeed implement some particular piece (e.g., HTML parsing).

The demo has 3 parts:

Part A (40%) : You demo for us the correctness for web cache with a web server of your choice, and one of our web servers.
Part B (40%) : You demo for us how your web cache handles recovery after restarts with the same set of servers.
Part C (20%) : We ask you design-level and code-level questions about your system

Each demo will have 2 instructors -- either two TAs or Ivan and one of the TAs. One of us will be taking notes, and the second person will be laser focused on your demo -- what you are doing, how you are doing it, clarifying what's going on, etc.

Each team has a guaranteed slot of 20 minutes. Each part of the demo is expected to take about 6.66 minutes. Note that 20 minutes may be a hard cut off, especially in cases where there is another team scheduled after your team. (We give ourselves about 5-10 minutes to deliberate your mark and review the demo after you leave the room. The more time we have for this, the more likely that you will get a fair shake).

Note that your demo will be run in our environment. That is, when you enter the demo room, we will have an Azure VM up and running and ready for your demo. The VM will have:

Go version 1.9.7 installed
Emacs/vi/vim installed
Your code cloned/checked out from stash (last version before deadline)

This means that if you need to install any additional packages on the VM then you will need to (1) write a script to install them, and (2) the installation/setup time will count as part of your 20 minutes. Please talk to us before the demo if you plan to install any dependencies.

Your demo in part (A) must obey the following parameters:

Part B (recovery):

Part C (design Q/A):

We will ask you questions from a list of questions that we have curated for this part.
We will try to make sure there is a whiteboard available for you to draw diagrams if you need one.
Each person on the team should attempt to answer the question on their own first. If the person is unable to do this, then the other person can jump in and help answer the question.
The questions are high-level design questions about your system. They are the kinds of questions that everyone on the team should know the answer (at least to some degree!).
These questions may escalate into discussion about your design choices more broadly.

Nice to have, or strongly recommended for the demo:

Strongly recommend including (comprehensible) stdout/logging in your codebase so that we can track things as they occur on the terminal. If you lack this, then we can only go by the output in the browser, and if that doesn't work, then it will be hard (impossible) for us to justify any partial marks.
A demo plan. We will have specific things we will want you to demo for us (with our server). But, how you decide to demo the functionality of your web cache with your set of web servers is up to you. Be prepared!
Make sure to practice the demo and select pages that you will demo for us. You may want to use a private web server that serves pages that are especially well suited to your demo.

What to bring for the demo:

You'll need your own laptop to drive the demo. I recommend bringing two so that you can coordinate with your partner, and you have some redundancy + 4 hands can type faster than 2 hands.
Team spirit.

Extra Credit

This project is extensible with the following extra credits:

EC1 [1% of final mark] Implement a policy called ELEPHANT in which cached content is never deleted (from disk) and rotates in and out of memory. That is, use the size parameter for the memory cache, but assume that the disk has unlimited capacity. Storage is cheap :) Updated the web cache logic as follows:
- If an item is not in the memory cache, then the web cache should check if the item is stored on the disk. If it is on disk, the web cache should bring the item into memory (possibly evicting another item using the LRU policy), and return the item to the client.
- When evicting an item from the cache, the item must be stored to the disk instead of just being thrown away.
Remember that your web cache should still be persistent and should still be able to rebuild after a restart or a crash.
EC2 [1% of final mark] Modify the web-cache to be consistent with the cache-control directives in the HTTP header field.
- More Information on cache-control directives in the header.
- You only need to support the following cache-control directives:
  - public : Indicates that the response may be cached by any cache.
  - no-store : The cache should not store anything about the client request or server response.
EC3 [1% of final mark] Write a JS script that determines whether or not a resource on the same page is being cached or not. That is, your JS code should detect the presence of a web cache. Your JS code should be as independent of your web cache design/implementation as possible. For example, the same piece of code should be able to detect a web cache implemented by a different team in the course.

Make sure to follow the course collaboration policy and refer to the submission instructions that detail how to submit your solution.