Project 2

416 Distributed Systems: Project 2 [Open ended]

Fall 2018

Project 2 is an open-ended project that must be done in a team of 4 people and must be (at least partially) deployed on Azure. The extra number of people (over project 1) will provide you with sufficient developer power to execute on an ambitious project.

Type of project

Your project must address a non-trivial problem related to distributed systems. It must include a substantial software effort in Go. Note that 'substantial' includes complexity and not just code size. The most direct way to satisfy the project requirement is to prototype a distributed system. Such a system can be built from scratch, but the project can also be formulated as a non-trivial extension to an existing system. The idea behind the system does not need to be original, but the majority of the distributed logic in the implemented system must be implemented by the project team.

As a benchmark, your project must have about the same complexity/difficulty as project 1.

Project constraints (evolving):

Go must be used for the core distributed logic in the system. However, other languages may also be used in the project. For example, you can build a distributed system in Go and have Android clients, implemented in Java, that connect to it and use it.
The system must be able to support node churn: nodes that fail and leave the system, as well as nodes that join the system.
The system cannot be embarrassingly parallel: there must be some distributed state and coordination between nodes in your system.
At least some part of the system must be deployed on Azure.
The system must be well tested.

Project ideas

Here are several project ideas. Treat these as inspiration; I strongly encourage you to come up with your own project idea.

Project idea: Consensus-Based Failure Detection

After A1 you've become more interested in failure detectors. You start reading more and realize that the failure detector you developed in A1 is not robust to network partitions, which may cause inconsistencies in the system about the state of a node. For example, Node A and Node B are monitoring Node C. If Node A is partitioned from Node B and Node C, then Node A will incorrectly report that Node C has failed while according to Node B, Node C is still alive. You find out about the Paxos algorithm and realize that this problem is in fact a consensus problem. You decide to implement a consensus protocol, like Paxos or Raft, on top of your failure detector library to decide if a majority of systems believe that a node has really failed.

Project idea: Regionally Restricted Streaming Service

If you were to take a random survey of current UBC students to determine the answer to the question "How do you procrastinate?", the majority of the responses would have something to do with watching Netflix. But as a current 416 student, you have already been exposed to web-proxies and CDNs. So, now your favorite way of procrastinating is to build Netflix from scratch instead of watching it. To do this, you would need to make a video streaming service built on top of a custom CDN, which provides regional restrictions. To make your system more robust, you may also choose to use a distributed key-value store to house your user data.

Project idea: A version of DynamoDB

"Amazon is expanding in downtown!" is something you have gotten used to hearing after living in Vancouver the last few years. As an upper-year CS major, you have started looking for well-paid full time jobs and identified Amazon as a viable destination. Being a smart group of 416 students, you decide to build Amazon's DynamoDB from scratch as you think that would be a good way to impress an Amazon recruiter. You also like how they use CRDTs to provide eventually consistent reads but you are also interested in finding out how they can also provide strongly consistent reads across their whole distributed database. And, to widen your scope, you decide to use your newly built DynamoDB on Azure instead of AWS to increase your chances of impressing a Microsoft recruiter, as well!

Project idea: Build an automated grading framework

As some of you have already seen and experienced by now, distributed systems are hard to build and even harder to debug. One of the main reasons for this is the presence of different sources of non-determinism present in the system, specifically due to the network. This makes it even harder to grade complex systems as testing for basic correctness properties of a distributed system requires a lot of time/effort. As some of the current and previous 416 TAs will tell you, writing specialized auto-grading frameworks for each distributed system can be time-consuming and error-prone. Your mission, should you choose to accept it, is to create an automated grading framework that works for any general purpose distributed system which provides a clean and tidy interface to control and specify different network conditions as well as a way to specify different test cases.

Project idea: Build an anonymity network

Tor is an anonymity system built on onion routing. Tor allows clients to obfuscate their network identity/location (IP address). The idea is simple, but supporting multiple clients, defending against attacks, and providing good performance to clients (e.g., responsive browsing) are non-trivial requirements.

One version of this project is to prototype a basic version of Tor, deploying it on Azure, and demonstrating that you can use it to browser the internet. A basic version might include:

Handling connecting/disconnecting guard/relay/exit nodes
Secure onion routing (intermediate hops do not observe payload)
Circuit setup/tear-down protocols
Periodic circuit refresh to avoid using a circuit for too long

Tor is just one type of anonymity system. If you are interested in this space, there are a variety of other system designs that you can adopt. Or, feel free to create a new one!

Project idea: Build a peer-to-peer machine learning system

Machine learning is all the rage. There are many distributed frameworks, but all of them assume a centralized learning process with access to a central store of training data. Build a peer-to-peer solution for learning a global model (of a variety of your choice) that has as few centralized components as possible and where data is spread across peers. Assume an adversarial context in which peers do not want to reveal their data to others. For this project you may want to recruit to your team someone who has taken CPSC 340 (and has done well in it). You can also substantially expand the security/privacy requirements of this project.

Project idea: Build a distributed web crawler/search engine

Web crawling is kind of a 90s topic. But, an efficient and scalable version is a complex distributed system with many interesting pieces. An assignment from 416-2016w2 describes an 'assignment' version of a web crawler that is a good starting point. This version described a set of worker crawlers that are spread over multiple data-centers, a web-graph that is maintained in a distributed fashion, a distributed page rank computation, and keyword search capability. You could extend this version or consider building a different variant.

Some other project ideas

Build a distributed object system, like Emerald but without a compiler.
Build a distributed shared memory system, like Treadmarks
Build a distributed assertions mechanism that can be used for runtime checking of distributed systems.
Implement a byzantine fault tolerance algorithm, an example is PBFT

Proposal

A project proposal is a paper that details the problem, your proposed approach/solution to the problem, a realistic timeline for your team's actions to create the solution, and a SWOT analysis for your team/project.

You should aim for a proposal that is about 5 pages long. Shorter and you're probably missing some detail; longer and it becomes too detailed and too long to read. That said, there are no page limits (lower bounds nor upper bounds) on your proposal.

Here are three example proposals from a prior instance of this course:

Here are two high-level ways in which I think about your proposal:

A proposal is a contract. If you build the thing described in the proposal then you get a perfect mark on the project. But, writing good contracts is hard work. For example, a good contract must be precise (it should be clear what you are and are not going to do).
A proposal is your opportunity to convince me that you know what you're getting yourself into. I won't let you do a project if I know that you do not stand a reasonable chance of succeeding at it (this is a distributed system course, not an SE course :-) So, the proposal should convince me that you know what you're doing -- that you've thought about the key issues (you know what they are, approximately how you're going to solve them), you know what resources you will need/where you will get them (technology/libraries/algorithms/data sources/hardware/etc), that you thought about how to manage your time and how to manage the team roles and responsibilities (who does what/when), and that it all adds up to a realistic plan for a successful project.

You may also find the following proposal advice useful (from a grad course that I taught).

SWOT analysis:

Your proposal must include a SWOT analysis, which is a project planning tool/exercise. The focus of the SWOT analysis should not be on your idea, but on the various factors that will influence your ability to execute successfully. This includes things like human resources, time/scheduling constraints, etc.

There are three key things you should focus on when you put this together:

Do this as a team: don't outsource this to one team member
Be honest: if you are worried about something, this is your chance to get it out in writing
Be specific: you want each item in SWOT to be one concrete factor, so articulate it as tightly as possible.

Here are some fairly generic examples (i.e., yours should be more specific):

Internals (strengths/weakness):

s: all team members have worked with each other before, so are familiar with each other's work style
s: entire team has extensive experience in programming in Go
s: project is based on an existing system that is well documented and that two of the team members know inside and out
w: none of the team members know each other
w: team members have a variety of communication styles, some of which will require non-trivial management
w: project will be difficult because none of the team members understood Ivan's lecture on BitTorrent

Externals (threats/opportunities) -- you'll probably have fewer of these than the internal ones:

t: team decided to use Android phones, but this require finding a library that supports Go-Dalvik VM cross-compilation, which may or may not exist
t: three of the four team members might have to leave town to compete in the pan-American synchronized swimming competition; this would make them lose two weeks of project work.
o: one of the team members has a relative that works at Raspberry Pi who agreed to send us 100 Pis to use for the 416 project
o: new version of Go comes out in two weeks and the word on the street is that this version will include native support for distributed objects, which will make our project 10x faster to build

Your proposed project might evolve

The proposal is your best effort at scoping out the challenges that you expect to come up against and some ideas/plan on how you will resolve these. But, of course, system design and software engineering is not that predictable.

It's difficult to describe how much you can deviate from the proposal. So, UDP instead of TCP may not be a significant change for some proposals, but could be a major change for others (e.g., if you are investigating distributed congestion control adaptation in TCP and now change to UDP, the difference is major!).

Please discuss potential major changes with the TA assigned to your group and/or with Ivan.

Prototype implementation

There are no constraints on your distributed system design and implementation outside of the ones listed at the top. More details listed below. If you have any questions, please ask on Piazza.

Report

Your final report is a description of the problem you attempted to solve, what you have built to solve the problem, why you built your system the way you did, and how the system works/doesn't work. You should aim for a final report that is no more than 8 pages long. More details below.

Details about deliverables

Project proposal: a well thought out project plan.

[Draft proposal submission] To submit a project proposal draft, create a private piazza post. For this post use the title: "Project 2 proposal draft: [[title]], uidA uidB uidC uidD" with with [[title]] replaced with your project title/name. The easiest way for us to give you feedback is if your post includes a link to an editable google doc containing your proposal. Make sure to identify the group members in the pizza post and the proposal body.
[Final proposal submission] If you submitted a google doc that we can access (above), then continue working on that doc and we will snapshot your google doc at deadline time. If you did not submit a google doc, then you can update your piazza post with the final proposal doc copy, preferably in pdf format.

Project report: a paper detailing the problem, your approach/solution, design of your prototype, and an evaluation of the prototype.

Report must be 8 pages max. This includes all the things that you want Ivan/TAs to read/see, including ShiViz diagrams.
Use your group's github project repository to submit your report as a pdf document. Place your report into report/report.pdf at the top level of your repository (if you use LaTex, make sure that it is compiled into a pdf).
Use whatever format you want, but please don't torture us with font size 8 and awful margins. I recommend the 2-column ACM article format.
You can copy/paste and reuse text from the proposal.
But, of course, don't plagiarize other's work! Attribute all the images/text you borrow; standard writing practices apply.
Your report must stand on its own -- cannot refer to proposal or to a youtube video where you explain your system in an hour-long lecture.
The report must describe the system whose code you are submitting by the code/report deadline. This means that if you have some bits that are unfinished, but you plan to finish them for the demo, then you must explicitly note in your report that they are a work in progress. Note that the ShiViz extra credit cannot be a work in progress item; we have to see ShiViz diagrams for your system in the report.
The report should include an approximate description of your demo script. The demo has specific requirements (see below), and we want to see an outline of your demo plan that matches these requirements in your report. This doesn't have to be long: a short paragraph per demo stage is fine.

Prototype implementation: the details of your system, primarily its code.

Make sure to create a repository with the name P2-uidA-uidB-uidC-uidD using the uids of the four team-members on your team. Follow previous assignment/project instructions to allow Ivan and the TAs access to the repository (but not other teams in the course).
We will read your code.
We will read the code that is in your repo by the deadline.
Note that this version of the code must correspond to what you describe in your report. For example, if you describe a Tor-based system in your report and do not note any work-in-progress items, and we do not see any onion encryption code in your repo, your mark will be penalized.
It is okay if your code doesn't work completely! Your report should note what currently works and what doesn't work.
No, you are not required to include code comments, compilation instructions, or anything else that would make our code-reading lives easier. (Though we would certainly appreciate any such effort).

Project demo: a 25-minute private demo of your project to the instructor and group TA, including a technical Q/A regarding the project design and implementation.

The github project repositories will not be frozen after you submit your code and report. So, you can continue to use your repository to develop and improve your system for the demo! Yes, that means that you can add new code/change existing code/etc.
No, we don't care how much new code you add between report and the demo -- if only 10% of your proposed system is built by report-time (and 90% is a work-in-progress), then expect penalties on the code/report. The demo is a separate beast marking-wise.
If you are working on the EC, you must generated GoVector logs and use ShiViz live during your demo to receive full EC marks. You can do so during the normal operation step, or another step in the demo (below).
Your demo is 25 minutes long. Here are the components/time/demo-mark break down:

Demonstrate normal operation of your system (no failures/joins) with at least 3 nodes.

10min expected
40% of demo mark

Demonstrate system can survive at least 3 node failures

5min expected
20% of demo mark

Demonstrate system can join and utilize at least 3 new nodes

5min expected
20% of demo mark

Design Q/A

5min required (we will stop you at 20min mark to do Q/A)
20% of demo mark
We will ask questions of the entire group and anyone on your team can answer.

There are several critical notions in the rubric above that will vary from system to system (group to group):

Normal operation: show that your system achieves its stated function (e.g., serves HTML to web-browser clients from a CDN, sends email via ToR, etc)
Survive: show that your system's normal operation is not disrupted by the failures (e.g., game continues to be playable after failures)
Utilize: show that your system actively uses the newly joined nodes (e.g., database integrates and uses new nodes to store keys/values)

To get full marks on the demo you must (1) define the above in the report or in the demo, and (2) demonstrate to us that the above conditions are satisfied by your system during the demo (e.g., when you fail/join nodes). You can do this by some of the following:

Show us terminal output with copious verbal explanations
Show us a web browser GUI that shows us blinking lights that semantically match the above goal
Robots that behave as expected (where you defined for us what is expected)
Some other, typically runtime system I/O, means

For failures, you can decide which nodes to fail and how (though if you fail the Azure LB that has a standby that you did not build.. you won't be getting much/any of that 20% survival mark).
Yes, we want to see you inject failures, preferably on a terminal with a Ctl-C signal. Same for node joins.
Your project must use Azure in some way. You have to explain/show that this is indeed the case.
Your system must use a real network between your nodes -- distributed systems that uses localhost for communication will be severely penalized.
As with project 1 demos, your slot is tight -- we may have scheduled other groups before/after your group. I strongly encourage you to practice your demo multiple times and develop a demo script.
Note that unlike project 1 demos, your project 2 executes in your environment. This means that you can curate it and set it up just the way you want it.
Some projects may have special requirements (e.g., prohibit failure of 3 nodes). If this is the case, post on piazza to arrange a change to your demo components. You must discuss these with us before your demo and we have to sign off on any deviations from the above in writing via a piazza post.

Deadlines

All project 2 deliverables are due at 11:59PM on their respective dates. The project is structured as a series of regularly occurring deadlines Do not miss these!

Nov 12 : Project proposal drafts (not marked, for feedback only).
Nov 17 : Final project proposals
Nov 24 : Each team must send email to an assigned TA to schedule a meeting to discuss project status.
Dec 9 : Project code and final reports
Dec 10-13 : Project demos

Grading scheme

Note that two key project deliverables are write-ups (proposal/report). The proposal write-up alone is 10% of your final mark! The proposal and the final report must clearly convey the high-level ideas, be technically thorough, and must be well-written. Quality technical writing takes time and care. Use well-established methods to improve your writing: draft increasingly detailed outlines, get feedback from your peers/TAs on early ideas and drafts, compose descriptive infographics/diagrams, use the spellchecker, etc. Proposal write-ups that are vague, incomplete, or incoherent will receive a poor mark (you will also probably have to redo your proposal, but with much less time).

Project 2 is 30% of your final mark. Here is the mark breakdown:

Proposal: 10%
Report and code: 10%
Demo: 10%

Extra credit

This project is extensible with an extra credit:

EC1 [2% of final mark]: Add support for GoVector and ShiViz to your system. Generate comprehensible ShiViz diagrams that explain your distributed system data/control flow and protocol design. These diagrams/explanations must be in your final report and you must show a live demo (loading logs into ShiViz and generating and explaining the result). Store the logs for your diagrams in the report/demo in the report repository.

Make sure to follow the course collaboration policy.