This assignment will teach you about distributed mutual exclusion and
will introduce you to issues related to node churn.
Your objective
Implement a version of
the Ricart-Agrawala
distributed mutual exclusion protocol. Your version must have the
following additional features:
- Continue to operate when nodes fail
- Be able to add new nodes to the system
- Ability to trace and record distributed state
using dinv
annotations and to visualize state and events in an execution with
ShiViz
- Record distributed state in a way that allows dinv to infer
the safety invariant of mutual exclusion (see below) for
executions of your system in which nodes do not fail and new nodes
do not join
Implementation requirements
-
Nodes can halt-fail at any time. For example, a node can fail
while it executes its critical section (CS), or while it is
attempting to join the system. Your system must continue to
operate correctly even as nodes fail (i.e., consider a node that
failed in the CS as having implicitly ended its CS early).
-
Nodes can decide to join the system at any time, and multiple
nodes may decide to join the system simultaneously. For a
sufficiently long time period during which no node failures occur,
a node that is attempting to join the system must eventually join
the system.
-
Your system must maintain the original guarantees of
Ricart-Agrawala: safety (mutual exclusion for the CS), liveness
(system makes progress), and fairness (bounded wait time and
in-order execution of critical sections).
-
You must use Go's net library and UDP, TCP, or RPC for
communication. You cannot use any non-standard Go libraries other
than GoVector and dinv.
-
Nodes in your implementation should only attempt to execute the CS
if they are not currently waiting to execute the CS, and they
should attempt to execute the CS with
probability FlipProb every
FlipInvokeCS seconds.
-
The CS code of each node implementation should be clearly marked
and must perform at least the following: (1) print to stdout the
logical time and the fact that the node is in the CS, and then (2)
sleep for
CSSleepTime seconds before exiting the CS.
Assumptions you can make
- Every non-failed node can communicate with every other node
- At least one node that is part of the system will always be
alive
- A node that has failed cannot re-join the system with the same
IP:port pair
- The inter-node round-trip time is upper bounded by a
constant RTT and can be used to detect node failures
- Joining nodes know exactly one node (not necessarily the same
node) that is alive and is part of the system
- The node IP:port pairs can be used as a logical clock
tie-breaker
Assumptions you cannot make
- Nodes have synchronized clocks
- Nodes join at some specific rate
- Nodes fail at some specific rate or in some deterministic order
- Joining node knows more than one other node in the system
- Nodes have durable storage (you should not need it)
Solution spec
Write a single Go program called node.go that acts as a node
in the above system. Nodes can start in one of two modes: bootstrap
and joining.
Given a -b flag in the first position the node should behave
in bootstrapping mode and assume that it is the only node in the
system. Other, additional, nodes can be added to the system in joining
mode only. A node started in bootstrap mode takes the following
arguments on the rest of its command line:
- IP:port : address that the bootstrap node should use to
communicate with other nodes in the system
- RTT : the RTT constant in milliseconds
- FlipProb : probability (value between 0 and 1) of attempting (and
possibly blocking) to execute a CS
- FlipInvokeCS : milliseconds to wait between checking to execute a CS
- CSSleepTime : milliseconds to sleep in a CS
- shiviz-logfile : filename to which GoVector log messages will be written
- dinv-logfile : filename to which dinv state information will be written
Given a -j flag in the first position the node should behave
in joining mode. A joining node takes the following arguments on the
rest of its command line:
- IP-other:port-other : address of one other node that is already
part of the system that the joining node will use to join the
system
- IP:port : address that the joining node should use to
communicate with other nodes in the system
- RTT : the RTT constant in milliseconds
- FlipProb : probability (value between 0 and 1) of attempting (and
possibly blocking) to execute a CS
- FlipInvokeCS : milliseconds to wait between checking to execute a CS
- CSSleepTime : milliseconds to sleep in a CS
- shiviz-logfile : filename to which GoVector log messages will be written
- dinv-logfile : filename to which dinv state information will be written
What to hand in
You must hand in two artifacts via email:
-
The node.go file that implements the two node roles
detailed above.
-
A README file that (1) documents the completeness of your solution
(under what circumstances it works or doesn't), and (2) command
line arguments to dinv that allow me to use it to infer the safety
invariant for your system.
-
A 1-2 page write up in a PDF file called assign3.pdf that
describes:
-
How your system design is robust to node failures and
how your design supports joining nodes. This discussion must be
framed in terms of the three Ricart-Agrawala guarantees that your
system must maintain (i.e., "handling" failing and joining nodes
without maintaining these guarantees is easy).
-
How you encode your node state with dinv annotations for dinv
to infer the desired safety invariant. And, why the inferred
dinv expression indicates that your system satisfies the
safety invariant.
Grading
Assignment will be marked based on me running your system and reading
your code and design document. I will give out partial marks in case
your system is incomplete.
- System works with no failures/joins: 40%
- System works with failures: 20%
- System works with joins: 25%
- Document describes a valid system design that satisfies the
three guarantees: 10%
- Dinv infers the safety invariant for a run of your system
with no failures and no joins: 5%
Advice
-
As with most initial distributed system prototypes, prioritize
correctness over everything else. For example, do not worry about
performance. You will receive marks according to the correctness
of your system implementation and its design.
- Integrate GoVector early on and use ShiViz to confirm/check
expected behavior of your system.
-
Implement the basic version of Ricart-Agrawala that does not
support failing or joining nodes. In addition to the
original
paper there are numerous internet resources that you can use
to understand the protocol.
-
One of your tasks is to design and implement a new protocol by
which nodes join the system (standard Ricart-Agrawala does not
include this protocol). A good design should precisely define: (1)
when a joining node considers itself to be part of the system
(i.e., when it can request to execute its CS), and
(2) when other nodes in the system consider a joining node to be
part of the system. Note that you, as the designer of the
system, get to decide these.
-
The problem of membership management/reconfiguration in
asynchronous networks has been extensively researched. Avoid
confusing yourself by looking at this work. This assignment
assumes a synchronous network model, making membership
management (and failure detection) much easier to design.
-
For the first version of your joining protocol you can assume that
nodes join through the same node in the system (i.e., concurrent
joins are serialized through the one portal node). Then, solve the
problem of failures (of the joining node, the portal node, and
other nodes in the system). Then, tackle the full joining protocol
where nodes can join through different nodes.
-
There are at least three challenges for designing a robust joining
protocol: (1) several nodes may join simultaneously, (2) the
joining node may fail, and (3) the node through which the joining
node initially communicates to the system could fail. Your system
must continue to operate correctly in the face of all three of
these events (and others!). Note that in case (3) the joining node
may be unable to join the system; in this case it should do the
reasonable thing (e.g., print an error and quit).
-
Pay careful attention to how failures and joining nodes impact the
three guarantees that your system must maintain.
- This assignment is fairly involved (think of it as a
mini-project). Start early, particularly in experimenting with
dinv.
Meta
Make sure to follow the
course collaboration policy and refer
to the assignments instructions
that detail how to submit your solution.
|