Assignment 1

416 Distributed Systems: Assignment 1

Due: September 18 at 11:59pm

Fall 2018

In this assignment you will get started with programming in the Go language. To solve this assignment you will need to install Go, figure out how to compile, run, and debug a Go program, and implement a UDP-based failure detector library described below.

Overview

Distributed systems are frequently designed to deal with failures. There are many kinds of failures (hardware/software/network/etc). The focus of this assignment is on node failures and you will design a failure detector (FD) for nodes in a distributed system. That is, your FD will be able to tell whether or not a node in a distributed system has failed. Note that real distributed systems, including those you will build in this course, operate over networks that are asynchronous and unreliable. This makes true node failure detection impossible: you cannot distinguish the failure of a node from the failure of the network. Therefore, your FD will be a best effort FD.

You will structure your FD as a library (fdlib) that can be re-used across projects. Your library must be able to monitor multiple nodes concurrently, integrate a simple round-trip time estimator, use a simple UDP heartbeat protocol to detect failures, and respond to heartbeats on behalf of the local node. Your library will be marked automatically (we will write clients and script scenarios to exercise your library). It is therefore important to follow the spec below exactly, including the fdlib API and its semantics, the UDP protocol/packet format, and other details.

High-level protocol description

In this assignment a distributed system is composed of some number of peer nodes, each of which uses the fdlib that you will implement. Each node may monitor some subset of nodes and may allow itself to be monitored by other nodes. This means that the monitoring node actively sends heartbeat messages (a type of UDP message defined below) to check if the node being monitored has failed, or not. Failure is determined/defined using the policy described below. Upon receiving a heartbeat, an fdlib that has been set to respond (through the fdlib API below), must respond to heartbeat messages with an ack message.

The following diagram illustrates an example three node system and the failure monitoring relationships between the nodes. In the diagram, Node 1 and Node 2 monitor each other (send each other heartbeats and receive acks); Node 2 also monitors Node 3. However, Node 3 does not monitor Node 1 nor Node 2.

fdlib API

Your fdlib must provide the following API. Conceptually, the API has two parts: calls to manage how fdlib deals with incoming heartbeat messages from other monitoring nodes (StartResponding and StopResponding), and calls to manage how the fdlib monitors other nodes (AddMonitor, RemoveMonitor, and StopMonitoring). These two sets of calls are independent, e.g., the library can be used just for responding to heartbeats, or just for monitoring other nodes, or both.

Also Note that all the calls below assume a single-threaded library client (each call invoked by the client must run to completion and return before another invocation by a client can be made). In the descriptions below if err (of built-in error type) is nil then the call succeeded, otherwise err must include a descriptive message of the error. There are no constraints on what the error message is, the exact text does not matter.

fd, notify-channel, err ← Initialize(epoch-nonce, ChCapacity)

Initializes the library with an epoch nonce with value epoch-nonce. Initialize can only be called once. Multiple invocations should set fd and notify-channel to nil, and return an appropriate err. The fd is an FD interface instance that implements the rest of the API below. The returned notify-channel channel must have capacity ChCapacity and must be used by fdlib to deliver all failure notifications for nodes being monitored.

err ← FD.StartResponding(LocalIP:LocalPort)

Tells the library to start responding to heartbeat messages on a local UDP LocalIP:LocalPort. The responses (ack messages) should be always directed to the source LocalIP:LocalPort of the corresponding heartbeat message. Note that fdlib can only be responding on a single LocalIP:LocalPort. Multiple invocations of StartResponding without intermediate calls to StopResponding should result in an error. The err should also be appropriately set if LocalIP:LocalPort combination results in a socket-level error.

nil ← FD.StopResponding()

Stops the library from responding to heartbeat messages. This call always succeeds.

err ← FD.AddMonitor(LocalIP:LocalPort, RemoteIP:RemotePort, lost-msgs-thresh)

Tells the library to start monitoring (sending heartbeats to) a node with remote UDP RemoteIP:RemotePort using UDP LocalIP:LocalPort. The lost-msgs-thresh specifies the number of consecutive and un-acked heartbeats messages that the library should send before triggering a failure notification. Multiple invocations of AddMonitor with identical RemoteIP:RemotePort and LocalIP:LocalPort values should (1) update the lost-msgs-thresh for this local/remote address:port pair, if the lost-msgs-thresh value is different from the previous invocation; or (2) result in a no-op, if the lost-msgs-thresh value is identical.

nil ← FD.RemoveMonitor(RemoteIP:RemotePort)

Tells the library to stop monitoring (sending heartbeats to) a node with UDP RemoteIP:RemotePort. This call always succeeds (e.g., it should succeed even if RemoteIP:RemotePort was never passed to AddMonitor).

nil ← FD.StopMonitoring()

Tells the library to stop monitoring all nodes (if any). This call always succeeds (e.g., it should succeed even if AddMonitor was never invoked).

Notification semantics:

All failure notifications must be delivered on the notify-channel that is returned by Initialize.
Notifying the application must not block the fdlib from monitoring nodes or responding to heartbeats. That is, the notification must be passed asynchronously on the notify-channel. You can assume that using a ChCapacity capacity for the notify-channel (passed into Initialize) is sufficient to prevent fdlib from blocking on sending on the notify-channel channel.

A failure notification for a node must be represented (on the notification channel) by structure:

      type FailureDetected struct {
          UDPIpPort string     // The RemoteIP:RemotePort of the failed node.
          Timestamp time.Time  // The time when the failure was detected.
      }

If a node X was detected as failed, then (1) a failure notification must be generated, and (2) the library must stop monitoring node X. To re-start the monitoring of node X after it was detected as failed, the client must again invoke AddMonitor.
No failure notification of node X must be generated after the call to RemoveMonitor(X) has returned.

Here is a high-level example (that omits some details) to illustrate the sequence of dfslib API calls that a local client (of the library) would use to (1) start responding to heartbeats, (2) monitor node X, and (3) monitor node Y:

The protocol on the wire

The heartbeat and ack messages in your system must have a specific format:

type HBeatMessage struct {
	EpochNonce uint64 // Identifies this fdlib instance/epoch.
	SeqNum     uint64 // Unique for each heartbeat in an epoch.
}

type AckMessage struct {
	HBEatEpochNonce uint64 // Copy of what was received in the heartbeat.
	HBEatSeqNum     uint64 // Copy of what was received in the heartbeat.
}

EpochNonce is used to distinguish different instances of fdlib. On restart, the fdlib would be initialized with a different EpochNonce (with call to Initialize), and it must ignore all acks that it receives that do not reference this latest EpochNonce.
The SeqNum is an identifier that is an arbitrary number which uniquely identifies the heartbeat in an epoch. This number could start at any value.
HBEatEpochNonce and HBEatSeqNum in the AckMessage simply identify the HBeatMessage that the ack corresponds to.

The diagram below is a "time-space" diagram (time flows down) that illustrates how an fdlib instance at node X would detect the failure of node Y (by not observing acks from the fdlib instance at node Y):

Note how fdlibX resends the heartbeat message (after an RTT timeout) exactly three times (based on the lost-msgs-thresh value of 3 passed to AddMonitor). After three heartbeat messages have all timed-out, fdlibX timesout on node Y and generates a failure notification. Your fdlib must implement this behavior precisely.

In general, your timeout mechanism should behave as follows:

If an ack message is not received in the appropriate RTT timeout interval, then the count of lost msgs should be incremented by 1.
When an ack message is received (even if it arrives after the RTT timeout), the count of lost msgs must be reset to 0.
Acks that arrive after a failure notification has been generated must be ignored.

Round-trip time (RTT) estimation

Your library must wait for a monitored node to reply to a heartbeat with an ack (stop-and-wait protocol). This means that there should, at most, be one heartbeat message in the network from a monitoring node to a monitored node. Only if the node does not reply in RTT time, then should the library send another heartbeat. How long should the library wait for a reply? Depending on where the node is physically located, the wait time will have to vary. Your library must implement a simple RTT estimator to customize the waiting time for each node being monitored. Note that this waiting time may vary for different heartbeat messages.

Your RTT estimator should work as follows:

Use a value of 3 seconds as the initial RTT value for a previously un-monitored node.
Each time an ack is received the library should (1) compute the RTT based on when the heartbeat was sent, and (2) update the RTT for the node to be the average of the last RTT value for this node and the computed RTT value in (1). Note that the same calculation must be performed even if the ack message arrives after the RTT timeout (but only if it arrives before a failure notification is generated).
The library must remember and use the last RTT value computed for a node (UDP RemoteIP:RemotePort pair) for all time. That is, computed RTT values must carry over across monitoring sessions.

If the fdlib receives an ack before the RTT timeout, then when should it send the next heartbeat? Your library should send the next heartbeat no longer than RTT time since the heartbeat that the received ack is acknowledging. That is, if the first heartbeat was sent at time 1 with RTT of 5, and an ack arrived at time 2. Then, the second heartbeat should be sent before time 6.

Assumptions you can make

You can assume that all messages fit into 1024 bytes.
Nodes will not monitor themselves (i.e., AddMonitor will not be called to monitor the UDP LocalIP:LocalPort passed to StartResponding).

Assumptions you cannot make

UDP is reliable.
Round-trip time between nodes is predictable or bounded.

Implementation requirements

The client code must be runnable on CS ugrad machines and be compatible with Go version 1.9.7.
Your code should not assume that UDP is reliable.
You must use the message types given out in the initial code.
Your solution can only use standard library Go packages.
Your solution code must be Gofmt'd using gofmt.

Solution spec

Write a single go source file called fdlib.go that implements the fdlib library described above. Download the fdlib.go starter code. Note that you cannot change the API in this fdlib.go. Our marking scripts will rely on this API to automatically grade your solution.

Place your fdlib.go file at the top level of the UBC GitHub repository that you are using for your submission. But, you can have other files in the repository, e.g., clients and scripts that you've developed for testing your fdlib.

Starter code and testing servers

Download the example client.go code. This code illustrates how a node in a distributed system may use the fdlib library that you are designing. You can (and should) use this client to test your library, though to rigorously test your system you may want to implement several other client variants, for example, clients that fail and test your fdlib failure detection capabilities.

We will release a testing server that will be running on 198.162.33.23:9999. You can monitor this server and the server will monitor you back, so you can test both your monitoring and responding logic. To monitor your node, the server will assume that your fdlib is responding from a local UDP-IP:(Port+42) that was used to send the server the heartbeats.

To check if the testing server detected your client failure you can take a look at the /tmp/416A1testing file on 198.162.33.23 (you can ssh into this machine and just tail -f or cat the file). This file is readable by all users and accumulates failure notifications by the testing server, which appends notifications to the end of this file. Note that the testing server is using a lost-msgs-thresh of 50 in its monitoring code.

Note, however, that the testing server cannot test your failure detection since (we hope) the server will not fail.

Rough grading scheme

Your code must compile on ugrad servers. Your code must not change the API above. Your code must work on ugrad servers.

If any of these are violated, your mark for this assignment is 0. This is true regardless of how many characters had to be changed to make your solution compile, and regardless of how well your solution works with a different API or on a different machine. The high-level A1 mark breakdown looks like this:

30% : StartResponding and StopResponding work
70% : AddMonitor/RemoveMonitor/StopMonitoring work

You may also want to look at the specific scenarios that we will be testing.

Advice

Start by implementing the AddMonitor command and testing it with with the testing server. Then, implement the other monitor API calls and the StartResponding/StopResponding calls. Test all of these with the testing server.
Have the SeqNum start at 0 and increment by 1. This will help with debugging.
Freeze the API, ideally with a few unit tests whose failure will tell you when you've changed it.
Compile and run your code on the ugrad servers.
Since you will be writing code that will use goroutines, you should consider using locks to serialize access to data structures that might need to be modified and accessed by more than one goroutine
Use gob encode/decode for sending structs across the network.
The time package is helpful for measuring elapsed time between heartbeats and ACKs.
You might find the SetReadDeadline method useful.
You might also want to consider writing an interactive client program that invokes different methods on the fdlib library based on the commands entered (e.g., If a user types "start X", then the method StartResponding(X) will be called). It will help you experiment with different scenarios without having to write multiple sample clients.
Note that due to NATs and Firewalls you may be unable to reach your fdlib clients running on ugrad servers from outside of the UBC network. Either test by running all code on ugrad nodes or by using the UBC VPN.

Make sure to follow the course collaboration policy and refer to the submission instructions that detail how to submit your solution.