Assignment 3

416 Distributed Systems: Assignment 3

Due: Feb 8th at 9PM

2016W2, Winter 2017

In this assignment you will extend your peer-to-peer system from assignment 2 to handle peer joins and peer failures.

High-level description

The system has a similar structure and purpose as in assignment 2: there are some number of peers and there is one server. As before, the peers need to coordinate their server interactions to retrieve some number of resources, and distribute these resources appropriately. In this assignment you will add two more features to your system: (1) a protocol by which new peers can join the existing running peers, and (2) ability to detect and recover from peer failures.

Supporting joining and failing peers leads to two peer-server interaction changes:

  • (Change 1) InitSession no longer takes the number of peers as an argument, since the number of peers changes over time.
  • (Change 2) The server reasons about a logical peerID namespace since actual, physical, peers arrive and depart, or "churn".
The second point above means that the peerIDs returned by the GetResource RPC command are logical peerIDs and do not refer to the physical peerIDs (that each peer is given on the command line). The mapping of logical peerIDs to physical peerIDs is a new management concern in your system.

Supporting joining and failing peers features will also substantially change how your peers coordinate and necessitates a re-definition of the three requirements in assignment 2. However, these two features do not change the three server-communication constraints in assignment 2 and your peers must continue to obey those constraints. As before, peers will print information to the screen and exit when there are no more resources. Below we list the updated peer-to-server protocol, precisely define the two new features, and update the three requirements for your solution. As before, the peer-to-peer protocol, supporting peer joins, failure detection, and failure recovery, is unspecified and is up to you to design and implement.

The details

The peer-to-server protocol is similar to assignment 2. The server listens to connections from peers and expects two kinds of RPC invocations: InitSession and GetResource. The InitSession initiates a session and returns a token called sessionID that must be used in the call to GetResource. The GetResource call returns a string, that we term a resource, and also the logical peerID that is associated with the resource. Your peer code must appropriately maintain a map between logical and physical peerIDs. Resources for the same logical peerID must live on the same physical peer. However, this mapping can (and must be able to) change, for example, as peers fail. Precisely, these RPCs look like this:

  • sessionID ← InitSession()
    • Initiates a new session. Returns a positive integer sessionID to the caller.
  • [resource, logical-peerID, numRemaining] ← GetResource(sessionID)
    • Retrieves a resource for a previously initialized session with ID sessionID. Returns three items:
      • A resource; an arbitrary string.
      • An ID of a logical peer that is associated with the resource. This ID is an integer greater than or equal to 1.
      • Number of remaining resources (i.e., resources available through further invocations of GetResource). This number will be greater or equal to 0.

In interacting with the server your peer group must continue to satisfy the three constraints: (Constraint1) the sessionID RPC must be invoked exactly once, (Constraint2) two consecutive invocations of the GetResource RPC cannot come from the same peer, and (Constraint3) the GetResource RPC must be invoked exactly (1 + numR) number of times, where numR is the numRemaining value received from the first call to GetResource.

Note that in the case that the system contains a single peer and the server returns a numRemaining value that is greater than 0, then the peer should block and wait until more peers join the system.

Peer joins. A peer can be started in 'bootstrap' mode or in 'joining' mode. These are indicated with a command line flag. In bootstrap mode the peer begins a new peer group and is the first peer in the group. In joining mode the peer knows of one other peer that is already in the group and must join the peer group through that peer. A peer has not technically joined the peer group until it invokes the JoinPrint function distributed in the starter code. Both joining and bootstrapping peers must invoke this function once they have joined.

Peer failures. Peers may fail in a fail-stop manner. Exceptions to when peers may fail are listed below under assumptions. A good way to think of this kind of failure is as if the peer stopped executing because the physical machine that hosted the peer has had its power unplugged. Peers that fail appear as having failed to all other peers simultaneously, possibly after a timeout that can be used to detect such failures.

The diagram on the right illustrates how your system might evolve over time. Initially just the server is part of the system, then peer 1, a bootstrapping peer joins; then, two more peers join, and finally peer 1 fails. This is one example scenario. Your code must support every possible combination of peer joins and failures.

Your peer group must satisfy three (updated) requirements:

  • Resource distribution requirement: all pairs of [resource, logical-peerID] returned by calls to GetResource and that have the same logical-peerID must be stored by a single peer process with some physical-peerID. Note that the association between logical and physical peerIDs may change over time, particularly when physical peers fail. If a physical peer has failed, then its resources must be stored (without any resource loss) by a different physical peer that has not failed.
  • Resource printing requirement: once all resources from the server have been retrieved, each resource stored by a peer must be printed to the peer's stdout using the FinalPrint function distributed in the starter code.
  • Termination requirement: after all of the resources have been retrieved from the server and peers have printed all of the resources that they store, all peers must terminate.

You will test and debug your solution against a server that we will give to you. Your peer can be run in one of two modes, in both modes the peer is given a physical peer ID (the peer's identity among peers) and an IP:port to use for peer-to-peer communication. In bootstrap mode the peer is also given the server's RPC TCP IP:port, while in joining mode the peer is instead given the IP:port of a peer that is currently part of the peer group.

Assumptions you can make

  • The server is always reachable, does not fail, and does not misbehave.
  • Each peer has a unique peer ID (specified on the command line).
  • After a session has been initialized, at all times there is at least one joined peer that is alive (non-failed).
  • A timeout of 3s can be used to detect failed peers.
  • Physical peer IDs are distinct (non-repeating).
  • After the server has returned a resource with numRemaining value of 0, no alive peers will fail and no new peers will join the system.
  • A peer that has retrieved a resource through GetResource does not fail for at least 3s after the GetResource RPC call has returned.
  • The node through which a peer joins has itself already joined the system, and does not fail until after the joining peer has joined.
  • A joining peer does not fail until after it has joined the system.
  • There are no ordering constraints on when peers terminate relative to one another, or the resource ordering that the peers use to invoke the FinalPrint function.

Assumptions you cannot make

  • Peers bootstrap/join the system according to the order of their physical peer IDs.
  • Physical/logical peer IDs are non-random and follow some pattern.
  • The peer through which a peer joins is non-random.
  • There is a known peer join pattern or number of peers that will join.
  • There is a known peer failure pattern or number of peers that will fail.
  • Nodes have synchronized clocks (e.g., running on the same physical host).
  • Perfectly reliable network (e.g., if you use UDP for your peer protocols, expect loss and reordering)

Implementation requirements

  • The client code must be runnable on CS ugrad machines and be compatible with Go version 1.6.3.
  • Your code cannot use stdout. You can print anything you want to stderr.
  • Your solution can only use standard library Go packages.
  • Your solution code must be Gofmt'd using gofmt.

Solution spec

Write a single go program called peer.go that acts as a peer in the protocol described above. Your program must implement the following two modes of command line usage:

Bootstrap mode:

go run peer.go -b [physical-peerID] [peer-IP:port] [server ip:port]

  • [physical-peerID] : the physical identity of this peer, an integer greater than or equal to 1.
  • [peer-IP:port] : the IP:port address that other peers joining the system can use to join through this peer.
  • [server ip:port] : the TCP address and port on which the server receives new client RPC connections.

Joining mode:

go run peer.go -j [physical-peerID] [peer-IP:port] [other-peer ip:port]

  • [physical-peerID] : the physical identity of this peer, an integer greater than or equal to 1.
  • [peer-IP:port] : the IP:port address that other peers joining the system can use to join through this peer.
  • [other-peer ip:port] : the address and port of one other peer that is already part of the system.

Starter code

Download the starter code. Please carefully read and follow the RPC data structure comments at top of file.

You can also download and use a bash script to automatically start several peers.

Testing server

Download the server binary. This binary should be run on a CS ugrad server using ./server [local RPC ip:port] [numLogicalPeers], where the first argument is the TCP IP:port on which to listen for incoming RPC connections from peers, and numLogicalPeers is the number of logical peers to use.

Rough grading rubric

  • 2%: Solution satisfies constraint1.
  • 10%: Solution satisfies constraint2 with peer joins but without failures.
  • 20%: Solution satisfies constraint2 with peer joins and failures.
  • 10%: Solution satisfies constraint3.
  • 10%: Solution satisfies the resource distribution requirement with peer joins but without failures.
  • 40%: Solution satisfies the resource distribution requirement with peer joins and failures.
  • 4%: Solution satisfies the resource printing requirement.
  • 4%: Solution satisfies the termination requirement.

Make sure to follow the course collaboration policy and refer to the assignments instructions that detail how to submit your solution.