Project [open ended]

416 Distributed Systems [open ended]

Multiple deadlines (see below)

Winter 2022

This is an open-ended project that must be completed in a team of 5-6 people. Optionally you can also, (at least partially), deploy it on Azure. The extra number of people (over A3) will provide you with developer power to execute on an ambitious project.

Type of project

Your project must address a non-trivial problem related to distributed systems. It must include a substantial software effort in Go. Note that 'substantial' includes complexity and not just code size. The most direct way to satisfy the project requirement is to prototype a distributed system design. Such a system can be built from scratch, but the project can also be formulated as a non-trivial extension to an existing system. The idea behind the system does not need to be original. The majority of the distributed logic in the implemented system must be implemented by the project team.

As a benchmark, your project must have at least the complexity/difficulty of assignment 3.

Project constraints (evolving):

Go must be used for the core distributed logic in the system. However, other languages may also be used in the project. For example, you can build a distributed system in Go and have Android clients, implemented in Java, that connect to it and use it.
The system must be able to support node churn: nodes that fail and leave the system, as well as nodes that join the system.
The system cannot be embarrassingly parallel: there must be some distributed state and coordination between nodes in your system.
Optional: Some part of the system deployed on Azure.
The system must be well tested.

Project ideas

Here are several project ideas. Treat these as inspiration; I strongly encourage you to come up with your own project idea. If you have an idea and you would like to discuss it, consider posting a description to Piazza.

Project idea: Consensus-Based Failure Detection

After A2 you've become more interested in failure detectors. You start reading more and realize that the failure detector you developed in A1 is not robust to network partitions, which may cause inconsistencies in the system about the state of a node. For example, Node A and Node B are monitoring Node C. If Node A is partitioned from Node B and Node C, then Node A will incorrectly report that Node C has failed while according to Node B, Node C is still alive. You find out about the Paxos algorithm and realize that this problem is in fact a consensus problem. You decide to implement a consensus protocol, like Paxos or Raft, on top of your failure detector library to decide if a majority of systems believe that a node has really failed.

Project idea: Regionally Restricted Streaming Service

If you were to take a random survey of current UBC students to determine the answer to the question "How do you procrastinate?", the majority of the responses would have something to do with watching Netflix. But as a current 416 student, you have already been exposed to web-proxies and CDNs. So, now your favorite way of procrastinating is to build Netflix from scratch instead of watching it. To do this, you would need to make a video streaming service built on top of a custom CDN, which provides regional restrictions. To make your system more robust, you may also choose to use a distributed key-value store to house your user data.

Project idea: A version of DynamoDB

"Amazon is expanding in downtown!" is something you have gotten used to hearing after living in Vancouver the last few years. As an upper-year CS major, you have started looking for well-paid full time jobs and identified Amazon as a viable destination. Being a smart group of 416 students, you decide to build Amazon's DynamoDB from scratch as you think that would be a good way to impress an Amazon recruiter. You also like how they use CRDTs to provide eventually consistent reads but you are also interested in finding out how they can also provide strongly consistent reads across their whole distributed database. And, to widen your scope, you decide to use your newly built DynamoDB on Azure instead of AWS to increase your chances of impressing a Microsoft recruiter, as well!

Project idea: Build an anonymity network

Tor is an anonymity system built on onion routing. Tor allows clients to obfuscate their network identity/location (IP address). The idea is simple, but supporting multiple clients, defending against attacks, and providing good performance to clients (e.g., responsive browsing) are non-trivial requirements.

One version of this project is to prototype a basic version of Tor, and optionally deploying it on Azure, and demonstrating that you can use it to browser the internet. A basic version might include:

Handling connecting/disconnecting guard/relay/exit nodes
Secure onion routing (intermediate hops do not observe payload)
Circuit setup/tear-down protocols
Periodic circuit refresh to avoid using a circuit for too long

Tor is just one type of anonymity system. If you are interested in this space, there are a variety of other system designs that you can adopt (e.g., Vuvuzela). Or, feel free to create a new one!

Project idea: Build a peer-to-peer machine learning system

Machine learning is all the rage. There are many distributed frameworks, but all of them assume a centralized learning process with access to a central store of training data. Build a peer-to-peer solution for learning a global model (of a variety of your choice) that has as few centralized components as possible and where data is spread across peers. Assume an adversarial context in which peers do not want to reveal their data to others. For this project you may want to recruit to your team someone who has taken CPSC 340 (and has done well in it). You can also substantially expand the security/privacy requirements of this project. Take a look at the Biscotti paper for an example of a sophisticated system in this space.

Project idea: Build a distributed web crawler/search engine

Web crawling is kind of a 90s topic. But, an efficient and scalable version is a complex distributed system with many interesting pieces. An assignment from 416-2016w2 describes an 'assignment' version of a web crawler that is a good starting point. This version described a set of worker crawlers that are spread over multiple data-centers, a web-graph that is maintained in a distributed fashion, a distributed page rank computation, and keyword search capability. You could extend this version or consider building a different variant.

Other project ideas

Build a fault-tolerant parallel computing platform based on Spark's RDD abstraction.
Build a peer-to-peer version of DropBox based on the design of XFS.
Build a distributed object system, like Emerald, but without a compiler.
Build a distributed shared memory system, like Treadmarks.
Build a distributed assertions mechanism for Go systems that can check a distributed system's properties at runtime, using Dinv as inspiration.
Implement a byzantine fault tolerance algorithm, a classic example is PBFT.

Project structure

Each project group will be assigned to a TA. This TA will be the point person for project advice, deliverables checking, weekly meetings, and other project-related logistics. You will do all your work in a git repository hosted by UBC enterprise github (you can make it public later, if you wish).

The required project deliverables are listed below. In cases where the deliverable is a written paper, I would prefer that you share the doc with the TAs and Ivan as an editable google doc. If you would rather use another submission approach, let me know. I would prefer final project reports in pdf format for an ACM SIG of your choice.

Project proposal draft: a paper (about 5 pages) describing as much of the project proposal as possible. However, the draft does not need to have well-defined milestones.
Project proposal: a paper (about 5 pages) detailing the problem you plan to address with your distributed system, your proposed approach, and a realistic timeline for your team's efforts. See proposal details below for more information. The proposal must detail three well-defined milestones. Each milestone must include (1) deliverables that you will share with TAs+Ivan for the milestone, (2) a written document that explains the deliverables and their status. The best way to think of the milestones is as a contract: if I accept your proposal and you meet the milestone you describe, then you will receive the full mark for the milestone.
1st project milestone: 1st milestone deliverables and a document that explains the deliverables and their status. For example, if the deliverable is code, then the document must describe what the code does, under what conditions it works, how to run and deploy the code, etc. If the deliverable is a dataset, then the document must explain the format of the data, what it means, how it was collected, how complete it is, etc.
Project update meeting 1: Your project group will meet with a TA assigned to your group (and maybe Ivan) to discuss your project status. You will discuss the first milestone, your progress towards the second milestone, and any outstanding questions/concerns that you might have.
2nd project milestone: 2nd milestone deliverables and a document that explains the deliverables and their status.
Project update meeting 2 (trouble groups only): If you have done poorly on your first milestone and we decide that your team is struggling, we will meet with your team again after the second milestone.
3rd project milestone: 3rd milestone deliverables and a document that explains the deliverables and their status.
Project demo: a live demo of your project work to TAs and Ivan during finals week. During the demo we will ask your team questions about your system implementation and design.
Prototype implementation: git repository with your code.
Project report: a paper detailing the problem you set out to solve, design of your system, implementation description, and some evaluation results. The final report (due at the end of the term) should be no longer than 8 pages (excluding references) and should resemble a research paper. See final report instructions below for more information.

Through the course of your project work, it is strongly recommended to have weekly meetings with your assigned TA for the project during his/her office hours. The whole team need not show up, the team lead is enough to provide your TA your team's status update, discuss concerns, queries or get feedback etc.

Proposal

A project proposal details the problem, your proposed approach/solution, and a realistic timeline for your team. The proposal must include at least the following sections: introduction/motivation, background, proposed approach/solution, evaluation methodology, timeline.

You should aim for a proposal that is about 5 pages long. Shorter and you're probably missing some detail; longer and it becomes too detailed and too long to read. That said, there are no page limits (lower bounds nor upper bounds) on your proposal. Note that a proposal draft is a proposal without milestones. All the items in this section apply to proposal drafts.

Here are two high-level ways in which I think about your proposal:

A proposal is a contract. If you build the thing described in the proposal then you get a perfect mark on the project. But, writing good contracts is hard work. For example, a good contract must be precise (it should be clear what you are and are not going to do).
A proposal is your opportunity to convince me that you know what you're getting yourself into. I won't let you do a project if I know that you do not stand a reasonable chance of succeeding at it (this is a distributed system course, not an SE course :-) So, the proposal should convince me that you know what you're doing -- that you've thought about the key issues (you know what they are, approximately how you're going to solve them), you know what resources you will need/where you will get them (technology/libraries/algorithms/data sources/hardware/etc), that you thought about how to manage your time and how to manage the team roles and responsibilities (who does what/when), and that it all adds up to a realistic plan for a successful project.

Here are three example proposals from an earlier instance of this course (include SWOT analysis, which you do not need to include; do not include milestones):

Here are three example proposals from a graduate course (these include milestones):

Detailed proposal instructions.

The timeline must include dates and milestones/deliverables. It must be sufficiently refined to include milestones that are specific to your project. Do not simply list the deliverables without listing the internal project deadlines. The timeline is there to get you to think about your time and to loosely commit to a schedule.
This is a distributed systems course, so make sure that your proposal is focused on issues/challenges/objectives relevant to this topic. If you can, try to focus on distributed abstractions: which ones will you be using, developing, and how will you evaluate their qualities.
The bulk of your proposal must be dedicated to design: what will your system look like, what properties will it have, what features will it include/omit, how will clients interact with your system, etc. This is the most important section. Writing this section well is difficult; spend the time to do a good job on it. The best way to write this design section is to look at A1, A2, A3 specs and model your design description based on those pages.
It is important to omit content that is irrelevant to your proposal. Before including text, consider whether or not it plays a purpose in explaining your proposed system and its objectives. If not, then it can probably be cut.
Consider giving your project/system a name. This way you can easily refer to it in your proposal.
Your project can re-use external code/algorithms/ideas that you find online (e.g., open-sourced Paxos Go implementations). Leverage prior work and build on it to avoid re-inventing the wheel and to get to interesting ideas quicker (e.g., implementing Paxos is itself a complete project).
Your proposals may end up including highly specialized content (e.g., details of crypto algorithms). Make sure to define non-standard/specialized terms, include examples, and intuition -- anything to help get your ideas across. This work will pay off in the long term: (1) it will get you thinking more deeply about your work, and (2) you can re-use it in the project write-up.
Make sure there is logical flow to your proposal. Define terms before you use them, motivate particular perspectives before launching into details, discuss existing systems or previous academic work necessary to understand your proposal before you rely on it for your descriptions. You do not have to provide an academic treatment of related work, but it does not hurt to read a bit about your topic and include references to inform your content.
You build your system so that you can eventually deploy it and run it. Your proposal must include a section on evaluation methodology in which you explain how you will evaluate that your system works as expected. For example, you might optionally also deploy your anonymity system on Azure VMs across different data centers and measure the end-to-end throughput of your system. Evaluation methodology should match your system goals; e.g., if your system provides access to a resource to many clients, then you should evaluate your system with many clients.
Consider including info-graphics/figures to explain your design. Sometimes it is easier to explain a complex idea with a picture (consider diagrams in the A3 spec). Likewise, don't hesitate to include formalism/math to explain your ideas (though, be careful with including formalism for formalism sake -- make sure it helps to explain rather than confuse the topic).
A well thought-out and detailed proposal will only benefit your group in the long run -- you will have a more clear idea of what you are really working on!

Submitting you project proposal (draft):

[Draft proposal submission] To submit a project proposal draft, create a private piazza post. For this post use the title: "Project proposal draft: [[title]], list of CWLs for your team" with with [[title]] replaced with your project title/name. The easiest way for us to give you feedback is if your post includes a link to an editable google doc containing your proposal. Make sure to identify the group members in the pizza post and the proposal body.
[Final proposal submission] If you submitted a google doc that we can access (above), then continue working on that doc and we will snapshot your google doc at deadline time. If you did not submit a google doc, then you can update your piazza post with the final proposal doc copy, preferably in pdf format.

Your proposed project might evolve

The proposal is your best effort at scoping out the challenges that you expect to come up against and some ideas/plan on how you will resolve these. But, of course, system design and software engineering is not that predictable.

It is difficult to describe how much you can deviate from the proposal. So, UDP instead of TCP may not be a significant change for some proposals, but could be a major change for others (e.g., if you are investigating distributed congestion control adaptation in TCP and now change to UDP, the difference is major!).

Please discuss potential major changes with the TA assigned to your group and/or with Ivan.

Prototype implementation

There are no constraints on your distributed system design and implementation outside of the ones listed at the top.

We will create a survey where you can specify your team members. We will then create a repository with the name Project-[CWLs-list] using the CWLs of the team-members on your team.
We will read your code.
We will read the code that is in your repo by the deadline of the respective deliverable.
Note that this version of the code must correspond to what you describe in your milestone/report. For example, if you describe a Tor-based system in your report and do not note any work-in-progress items, and we do not see any onion encryption code in your repo, your mark will be penalized.
It is okay if your code does not work completely! Your report should note what currently works and what doesn't work.
No, you are not required to include code comments, compilation instructions, or anything else that would make our code-reading lives easier. (Though we would certainly appreciate any such effort).

Report

Your final report is a description of the problem you attempted to solve, what you have built to solve the problem, why you built your system the way you did, and how the system works/doesn't work.

Detailed report instructions.

Report must be 8 pages max. This includes all the things that you want Ivan/TAs to read/see, including all diagrams.
Use your group's github project repository to submit your report as a pdf document. Place your report into report/report.pdf at the top level of your repository (if you use LaTex, make sure that it is compiled into a pdf).
Use whatever format you want, but please don't torture us with font size 8 and awful margins. I recommend the 2-column ACM article format.
You can copy/paste and reuse text from the proposal.
But, of course, don't plagiarize other's work! Attribute all the images/text you borrow; standard writing practices apply.
Your report must stand on its own -- cannot refer to proposal or to a youtube video where you explain your system in an hour-long lecture.
Any evaluation results must have a proper methodology to introduce the results. What was the goal of the evaluation? Why did you measure what you measured? Typically, the more information you provide to describe your experiments, the better. But, it requires careful judgment to report just the important details.
The report must describe the system whose code you are submitting by the code/report deadline. This means that if you have some bits that are unfinished, but you plan to finish them for the demo, then you must explicitly note in your report that they are a work in progress. Note that the ShiViz extra credit cannot be a work in progress item; we have to see ShiViz diagrams for your system in the report.
The report should include an approximate description of your demo script. The demo has specific requirements (see below), and we want to see an outline of your demo plan that matches these requirements in your report. This doesn't have to be long: a short paragraph per demo stage is fine.

Project demo

The project demo is a 45-minute adventure. You will demo your project to Ivan and a group of TAs in private, including a technical Q/A regarding the project design and implementation.

Detailed demo instructions.

The github project repositories will not be frozen after you submit your code and report. So, you can continue to use your repository to develop and improve your system for the demo! Yes, that means that you can add new code/change existing code/etc.
No, we don't care how much new code you add between report and the demo -- if only 10% of your proposed system is built by report-time (and 90% is a work-in-progress), then expect penalties on the code/report. The demo is a separate beast marking-wise.
If you are working on the EC, you must generated GoVector logs and use ShiViz live during your demo to receive full EC marks. You can do so during the normal operation step, or another step in the demo (below).
Your demo is 45 minutes long. Here are the components/time/demo-mark break down:

Demonstrate normal operation of your system (no failures/joins) with at least 3 nodes.

15min expected
40% of demo mark

Demonstrate system can survive at least 3 node failures

10min expected
20% of demo mark

Demonstrate system can join and utilize at least 3 new nodes

10min expected
20% of demo mark

Design Q/A

10min required (we will stop you at the 35min mark to do Q/A)
20% of demo mark
We will ask questions of the entire group and anyone on your team can answer.

There are several critical notions in the rubric above that will vary from system to system (group to group):

Normal operation: show that your system achieves its stated function (e.g., serves HTML to web-browser clients from a CDN, sends email via ToR, etc)
Survive: show that your system's normal operation is not disrupted by the failures (e.g., game continues to be playable after failures)
Utilize: show that your system actively uses the newly joined nodes (e.g., database integrates and uses new nodes to store keys/values)

To get full marks on the demo you must (1) define the above in the report or in the demo, and (2) demonstrate to us that the above conditions are satisfied by your system during the demo (e.g., when you fail/join nodes). You can do this by some of the following:

Show us terminal output with copious verbal explanations
Show us a web browser GUI that shows us blinking lights that semantically match the above goal
Robots that behave as expected (where you defined for us what is expected)
Some other means (typically runtime system I/O)

For failures, you can decide which nodes to fail and how (though if you fail the Azure LB that has a standby that you did not build.. you won't be getting much/any of that 20% survival mark).
Yes, we want to see you inject failures, preferably on a terminal with a Ctl-C signal. Same for node joins.
In case your project makes use of Azure in some way, you have to explain/show that this is indeed the case.
Your system must use a real network between your nodes -- distributed systems that uses localhost for communication will be severely penalized.
Note that your demo slot is tight -- we may have scheduled other groups before/after your group. I strongly encourage you to practice your demo multiple times and develop a robust demo script. It helps to curate the demo env and set it up just the way you want it.
Some projects may have special requirements (e.g., prohibit failure of 3 nodes). If this is the case, post on piazza to arrange a change to your demo components. You must discuss these with us before your demo and we have to sign off on any deviations from the above in writing via a piazza post.

Deadlines

All project 2 deliverables are due at 11:59PM on their respective dates. The project is structured as a series of regularly occurring deadlines Do not miss these!

March 18 : Project proposal drafts
March 25 : Final project proposals
April 1 : Milestone 1
April 4-8 : Project update meetings
April 8 : Milestone 2
April 15 : Milestone 3
April 19 : Project code and final reports
April 18 - 22 : Project demos

Grading scheme

The project is 45% of your final mark. Here is the mark breakdown across the different deliverables:

Proposal draft 2%

Proposal 5%

1st Milestone 7%

2nd Milestone 7%

3rd Milestone 7%

Demo 7%

Report and implementation 10%

Total (of final mark) 45%

Extra credit

This project is extensible with one extra credit option.

EC1 [1% of final mark]: Add sufficient tracing to your system to be able to observe the normal operation execution of your system across all the nodes taking part in the execution using ShiViz. The corresponding ShiViz diagrams must match your system design and should help explain how your system is implemented.

The diagrams and explanations of the diagrams must be in your final report (aim for 1-2 diagrams in total). Please store the logs for your diagrams in the report in your repository. You must also show us a live demo: generate trace output during the normal operation part of the demo, visualize the resulting logs with ShiViz, and explain the resulting diagram to us.

Make sure to follow the course collaboration policy.