416 Distributed Systems: Assignment 1Due: September 18 at 11:59pmFall 2018 |
In this assignment you will get started with programming in the Go language. To solve this assignment you will need to install Go, figure out how to compile, run, and debug a Go program, and implement a UDP-based failure detector library described below. OverviewDistributed systems are frequently designed to deal with failures. There are many kinds of failures (hardware/software/network/etc). The focus of this assignment is on node failures and you will design a failure detector (FD) for nodes in a distributed system. That is, your FD will be able to tell whether or not a node in a distributed system has failed. Note that real distributed systems, including those you will build in this course, operate over networks that are asynchronous and unreliable. This makes true node failure detection impossible: you cannot distinguish the failure of a node from the failure of the network. Therefore, your FD will be a best effort FD. You will structure your FD as a library (fdlib) that can be re-used across projects. Your library must be able to monitor multiple nodes concurrently, integrate a simple round-trip time estimator, use a simple UDP heartbeat protocol to detect failures, and respond to heartbeats on behalf of the local node. Your library will be marked automatically (we will write clients and script scenarios to exercise your library). It is therefore important to follow the spec below exactly, including the fdlib API and its semantics, the UDP protocol/packet format, and other details. High-level protocol descriptionIn this assignment a distributed system is composed of some number of peer nodes, each of which uses the fdlib that you will implement. Each node may monitor some subset of nodes and may allow itself to be monitored by other nodes. This means that the monitoring node actively sends heartbeat messages (a type of UDP message defined below) to check if the node being monitored has failed, or not. Failure is determined/defined using the policy described below. Upon receiving a heartbeat, an fdlib that has been set to respond (through the fdlib API below), must respond to heartbeat messages with an ack message.
The following diagram illustrates an example three node system and the
failure monitoring relationships between the nodes. In the diagram,
Node 1 and Node 2 monitor each other (send each other heartbeats and
receive acks); Node 2 also monitors Node 3. However, Node 3 does not
monitor Node 1 nor Node 2. fdlib APIYour fdlib must provide the following API. Conceptually, the API has two parts: calls to manage how fdlib deals with incoming heartbeat messages from other monitoring nodes (StartResponding and StopResponding), and calls to manage how the fdlib monitors other nodes (AddMonitor, RemoveMonitor, and StopMonitoring). These two sets of calls are independent, e.g., the library can be used just for responding to heartbeats, or just for monitoring other nodes, or both. Also Note that all the calls below assume a single-threaded library client (each call invoked by the client must run to completion and return before another invocation by a client can be made). In the descriptions below if err (of built-in error type) is nil then the call succeeded, otherwise err must include a descriptive message of the error. There are no constraints on what the error message is, the exact text does not matter.
Notification semantics:
The protocol on the wireThe heartbeat and ack messages in your system must have a specific format: type HBeatMessage struct { EpochNonce uint64 // Identifies this fdlib instance/epoch. SeqNum uint64 // Unique for each heartbeat in an epoch. } type AckMessage struct { HBEatEpochNonce uint64 // Copy of what was received in the heartbeat. HBEatSeqNum uint64 // Copy of what was received in the heartbeat. }
Note how fdlibX resends the heartbeat message (after an RTT timeout) exactly three times (based on the lost-msgs-thresh value of 3 passed to AddMonitor). After three heartbeat messages have all timed-out, fdlibX timesout on node Y and generates a failure notification. Your fdlib must implement this behavior precisely. In general, your timeout mechanism should behave as follows:
Round-trip time (RTT) estimationYour library must wait for a monitored node to reply to a heartbeat with an ack (stop-and-wait protocol). This means that there should, at most, be one heartbeat message in the network from a monitoring node to a monitored node. Only if the node does not reply in RTT time, then should the library send another heartbeat. How long should the library wait for a reply? Depending on where the node is physically located, the wait time will have to vary. Your library must implement a simple RTT estimator to customize the waiting time for each node being monitored. Note that this waiting time may vary for different heartbeat messages. Your RTT estimator should work as follows:
Assumptions you can make
Assumptions you cannot make
Implementation requirements
Solution specWrite a single go source file called fdlib.go that implements the fdlib library described above. Download the fdlib.go starter code. Note that you cannot change the API in this fdlib.go. Our marking scripts will rely on this API to automatically grade your solution. Place your fdlib.go file at the top level of the UBC GitHub repository that you are using for your submission. But, you can have other files in the repository, e.g., clients and scripts that you've developed for testing your fdlib. Starter code and testing serversDownload the example client.go code. This code illustrates how a node in a distributed system may use the fdlib library that you are designing. You can (and should) use this client to test your library, though to rigorously test your system you may want to implement several other client variants, for example, clients that fail and test your fdlib failure detection capabilities. We will release a testing server that will be running on 198.162.33.23:9999. You can monitor this server and the server will monitor you back, so you can test both your monitoring and responding logic. To monitor your node, the server will assume that your fdlib is responding from a local UDP-IP:(Port+42) that was used to send the server the heartbeats. To check if the testing server detected your client failure you can take a look at the /tmp/416A1testing file on 198.162.33.23 (you can ssh into this machine and just tail -f or cat the file). This file is readable by all users and accumulates failure notifications by the testing server, which appends notifications to the end of this file. Note that the testing server is using a lost-msgs-thresh of 50 in its monitoring code. Note, however, that the testing server cannot test your failure detection since (we hope) the server will not fail. Rough grading scheme
Your code must compile on ugrad servers. Your
code must not change the API above. Your code must work on ugrad
servers.
Advice
Make sure to follow the course collaboration policy and refer to the submission instructions that detail how to submit your solution. |
|