416 Distributed Systems: Assignment 2
Due: October 2nd at 11:59pm
In this assignment, you will implement a restartable web
cache. A web cache is a type of
server that stores frequently accessed web content to
reduce the latency observed by browser clients.
Internet users with similar interests often access and
download the same content. A single user will also often
access the same online resources repeatedly. Without a web
cache, every time a user requests some content, the response
must come from the server hosting the content. If many users
are accessing the same content at the same time, the response
time may increase and cause an overload on the server.
A web cache handles requests for popular content that would
otherwise be directed to the server, thereby preventing server
overload and decreasing the response time for users. Here is a
high-level diagram that illustrates how a web cache works in
Cache Policies Web caches are highly configurable. Configurations are manifested in the form of policies, which dictate how the web cache should behave in different situations. Examples include which content to cache, which content to evict and when, and whether or not to share cached data between clients. Note that since multiple clients may be using the same web cache, a web cache may present security and privacy concern, e.g., they may be susceptible to timing attacks that leak information about accessed content.
Web cache policies can be configured in many different ways and usually depend on the content. In this assignment, we will tell you what policies your web cache should use via command-line arguments.
HTTP also has built in cache policy parameters, which you will learn about if you choose to attempt the extra credit EC3. The parameters allow clients and servers to have more fine-grained control over caches that attempt to cache specific web page content (of course, a web cache may decide to ignore these directives).
Cache Updates Additionally, web caches must make decisions about when to update data stored in the cache. There are many policies for handling updates. For example, the web cache may wait for the user to explicitly say when to update the data. In this assignment, we will have a notion of expiration to determine whether to update the content or not.
HTTP is the protocol used by the browser to request content from the web cache, and by web servers to respond with the content. There are two types of HTTP messages: requests and responses. Your web cache will handle both of them.
HTTP Requests are made by clients to request some service from the server. The two most common types of HTTP requests, GET and POST, are used to request data from/send data to the server.
HTTP Responses are how web servers respond to a client request, they contain a response code and sometimes the requested data (or an error message).
In this assignment, you will design a web cache that caches and serves static web content retrieved by a browser using HTTP GETs. Your web cache must be able to serve multiple clients concurrently and must have persistent state to recover from crashes or restarts. Your web cache does not need to support HTTPS or dynamic content.
Browser. In this assignment we expect you to use the Firefox browser for development and for your demo. You will need to change the Firefox proxy HTTP settings to point to your web cache (the TCP IP:port of your active web cache instance).
Responding to Requests. The browser will send an HTTP request to the web cache for a URL. The web cache can ignore any requests other than GET requests. That is, the web cache must transparently proxy these requests to the appropriate destination (i.e., issue these requests on behalf of the browser).
The web cache must implement the following logic for GET requests. If the GET is for content that is in the cache (matching on the URL in the GET requests) and this content has not expired, then the web cache must return the cached content. Otherwise, it should do the following:
HTTP Message formats HTTP Requests are formatted as follows:
HTTP responses are formatted as follows:
HTML Parsing. To obtain the resources associated with an HTML page, the web cache must parse through the HTML page and find all the resources it needs to host to update the links correctly in the HTML page returned back to the client. The web cache should cache the resources with the page. You must only parse and obtain resources pointed to by attributes (src/href) of the following HTML tags:
Persistence. In the case of restarts or crashes, the web cache must be able to build the cache completely from disk once it is started up again. This requires that the cached content is written to disk. Your cache must guarantee that any content that is cached in memory is also cached on disk. For example, the updated pages must be written to the disk before they are sent back to the user as responses.
Cache Eviction and Replacement. If the cache is full and a new item must be added, a cached item must be chosen to be replaced according to the cache replacement policy. Your web cache must have support for the following cache replacement policies, which will be specified as command line arguments:
Item expiration. If an item in the cache expires (has been originally fetched from the source longer than expiration_time ago), then the web cache should delete the item from the cache. The client should never observe expired content. If the client issues a GET for content that expired, the web cache should re-fetch the content anew.
Here is what the protocol will look like in terms of flow of messages:
Note that in the above diagram index.html does not have any
other linked resources to be cached. Here is how the flow
differ if, for example, index.html had an image. In this case
the cache would (1) retrieve the image after receiving
index.html from www (parsing index.html to find the linked
image resource), (2) cache the image in memory and on disk,
(3) update/rewrite the image URL in the index.html to point to
the image in the cache, and (4) only then deliver index.html
back to the Client. After this point, the Client would fetch
the image from the web cache (and not www). Here is an
illustration of these steps:
There be dragons. This assignment is easy to describe, but under the surface there are many challenges that you will have to navigate with your team. These challenges are deliberately under- or completely un-specified in this assignment. This is because we want you to solve them on your own, in your own unique way. Make sure that you start early and that you think through your design in high detail before embarking on an implementation.
We suggest the following step-by-step approach for successfully completing the assignment:
The Azure cloud is composed of several data-centers, located world-wide. Each data center hosts many thousands of machines that can be used (i.e., rented) for a price. In this assignment you may use the Azure cloud to deploy and test your solution in the wide area. That is, you will deploy your web cache in a VM on Azure. The browser client does not need to be deployed on Azure (just run it on your personal machine).
Although you will test and deploy your system on Azure, it will have nothing Azure-specific about it, e.g., it should be possible to run your system on the CS ugrad servers without any code modifications (though we will not test this).
Using Azure: stop VMs when not using
We prepared a google slides presentation covering the basic workflow of getting a VM running on Azure for this/future assignments. To setup the Go environment in a VM you can use the azureinstall.sh script.
The default Azure subscription comes with a limitation of 20 cores per region. For this assignment you should not need cores above this limit.
Use this site to check your account balance.
Access information will be posted to piazza.
A key detail is that each second that your VM is running it is draining your balance (yikes!). You should STOP your VMs when you are not using them. It's up to you to police your own account.
Write a go program called webcache.go that behaves according to the description above.
WebCache's recommended command line usage:go run web-cache.go [ip1:port1] [ip2:port2] [replacement_policy] [cache_size] [expiration_time]
We will follow a Demo style grading scheme. At the high level your mark for this assignment is 15% of your final mark and has these components:
Note that the demo actually exercises both the Code and the Demo portion of your mark. So, the best way of thinking about the demo rubric below is that it is 100% of your mark. Of course, we reserve the right to look at your code on our own if we want to to further convince ourselves about some functional aspect or check that you do indeed implement some particular piece (e.g., HTML parsing).
The demo has 3 parts:
Each demo will have 2 instructors -- either two TAs or Ivan and one of the TAs. One of us will be taking notes, and the second person will be laser focused on your demo -- what you are doing, how you are doing it, clarifying what's going on, etc.
Each team has a guaranteed slot of 20 minutes. Each part of the demo is expected to take about 6.66 minutes. Note that 20 minutes may be a hard cut off, especially in cases where there is another team scheduled after your team. (We give ourselves about 5-10 minutes to deliberate your mark and review the demo after you leave the room. The more time we have for this, the more likely that you will get a fair shake).
Note that your demo will be run in our environment. That is, when you enter the demo room, we will have an Azure VM up and running and ready for your demo. The VM will have:
Your demo in part (A) must obey the following parameters:
Part B (recovery):
Part C (design Q/A):
Nice to have, or strongly recommended for the demo:
What to bring for the demo:
This project is extensible with the following extra credits: