Assignment 5

416 Distributed Systems: Assignment 5 [Basic Key Value Store]

Due: April 2, at 11:59pm Vancouver time

Winter 2021

So far, in the assignments, you have been working with distributed and mostly immutable state in the context of distributed computation (proof of work). In this assignment you will build a simple distributed key-value store (KVS) that will support mutable operations. Your KVS will resemble A4 in that there will be a single front-end node that receives and processes all client requests. However, unlike previous assignments, your KVS will need to deal with some failures. We will also build on this basic KVS in the final assignment.

Changes since @566, initial assignment release

Thu 1 Apr 2021 17:19:05 PDT
Most up to date starter.zip

  • @667, @610_f1 : KvslibGetResult and FrontEndGetResult value can be nil if err is false (i.e., if no previous put on the key)
  • @602 : Added per-key constraint to formal def of write monotonicity
  • @588 : Made spec in line with starter code, Start at frontend and storage takes a tracing.Tracer
  • @589 : Changed global uniqueness restriction on OpIds to 'For each client, OpIds are unique'
  • @591, @581_f1 : starter.zip - FrontEndAddr typo in storage config json file, and comma after OutputFile in tracing server json
  • Added a diagram to visually illustrate all the distributed tracing actions
  • @569, @571, @572, @573, @576, @580: various fixes to starter files
  • @571: Changes to KVS Initialize and Close APIs; kvs.Initialize should create traces, but not kvs.Close
  • Updates to all distributed tracing specs; these now capture monotonicity and failure requirements.

Overview

System architecture diagram

In this assignment you will develop a fairly centralized, but networked, key value store (KVS) that supports two operations from multiple clients: Put, Get.

The KVS is divided into a frontend node that serves requests from clients and a storage node that holds the key-value data. In this system, clients are unaware of the storage node and requests are directed towards a frontend node that issues the operation to the storage node hosting the data. The frontend node and the storage node must serve requests concurrently.

Your system will need to survive storage process node (process) halting failures. However, the storage node will have access to a local disk that it will use to persist all key-value data. If the storage node fails and later restarts, it must re-read (recover) the key-value data into memory from persistent disk storage. The storage node must serve read-only requests (gets) from memory, but data mutating operations (puts) must consistently update the disk state.

Clients must communicate with the frontend node and the frontend node must communicate with the storage node using a reliable messaging protocol. As in A4, you must use Go's built-in RPC library. You will also rely on the same distributed tracing library as in A4. We will use the captured distributed traces to evaluate the correctness of your implementation.

kvslib API

KVS lib diagram

Similar to A3/A4 the client will access the KVS through a library, called kvslib. Unlike these previous assignments, one or more clients may invoke multiple concurrent Get/Put operations against identical keys.

The kvslib Initialize, Get, and Put API calls below must each create a new distributed trace instance using Tracer.CreateTrace(). Note that the Initialize call should create a node-local "kvslib trace" that only records actions local to the kvslib (the frontend and storage nodes will have their own node-local traces). In the descriptions below, if err (of built-in error type) is nil then the call succeeded, otherwise err must include a descriptive message of the error. There are no constraints on what the error message is, the exact text does not matter. The kvslib has a KVS struct that must provide the following API (stub methods are given in kvslib/kvslib.go).

  • notify-channel, errInitialize(localTracer *tracing.Tracer, clientId string, FrontEndIP:FrontEndPort, ChCapacity)
    • Initializes the instance of KVS to use for connecting to the frontend, and the frontend's IP:port. The returned notify-channel channel must have capacity ChCapacity and must be used by kvslib to deliver all solution notifications. ChCapacity determines the concurrency factor at the client: the client will never have more than ChCapacity number of operations outstanding (pending concurrently) at any one time. If there is an issue with connecting, this should return an appropriate err value, otherwise err should be set to nil.
  • opId, errGet(tracer *tracing.Tracer, clientId string, key string)
    • This is a non-blocking request from the client to make a get call for a given key. In case there is an underlying issue (for example, the frontend cannot be reached), this should return an appropriate err value, otherwise err should be set to nil. Note that this call is non-blocking. The returned value must be delivered asynchronously to the client via the notify-channel channel returned in the Initialize call. The value opId is used to identify this request and associate the returned value with this request.
  • opId, errPut(tracer *tracing.Tracer, clientId string, key string, value string)
    • This is a non-blocking request from the client to update the value associated with a key. In case there is an underlying issue (for example, the frontend cannot be reached), this should return an appropriate err value, otherwise err should be set to nil. Note that this call is non-blocking. The value opId is used to identify this request and associate the returned value with this request. The returned value must be delivered asynchronously via the notify-channel channel returned in the Initialize call.
  • errClose()
    • Stops the KVS instance from communicating with the frontend and from delivering any results via the notify-channel. If there is an issue with stopping, this should return an appropriate err value, otherwise err should be set to nil.
The OpId values must have the following properties:
  • Values of opId must monotonically increase at each kvslib and be locally unique. They should be comparable at a single client to determine client operation request order. Globally, across kvslib instances, values of opId do not need to be comparable.
  • The opId type is uint32. You can assume that a client will not issue more than 2**32 operations during the execution of your system (you do not need to implement id rollover).
Items returned via the notify-channel should have the following type:
  
type ResultStruct struct {
    opId uint32
    storageFail bool
    result *string
}
Values of ResultStruct must conform to the following:
  • opId must correspond to some previously issued operation that synchronously returned this opId.
  • If the operation could not complete successfully because of storage node failure, then storageFail should be set to true, otherwise it should be set to false.
  • If opId corresponds to a Get(k) operation, then result should contain the value associated with key k. A get on a key that has not been previously put should set result to nil.
  • If opId corresponds to a Put(k,v) operation, then result should be set to to v.

KVS system consistency semantics

Your KVS system must provide monotonic read and monotonic write consistency. These two consistency types are usually used to describe what a single process in the system observes, without considering which key is being read/written. In this assignment, we will use per key monotonic read/write notions. Your KVS does not need to provide consistency guarantees across different keys.

Note that the consistency semantics defined more precisely below specify what clients of your system should observe about the KVS store. You can achieve these semantics in several different ways in your implementations of kvslib, frontend, and storage nodes. In particular, you can build your system to provide stronger semantics. Though, this may lead you to lose out on the KVS performance bonus (see rubric below :-)

  • Monotonic reads: Assume that a kvslib delivers a ResultStruct with result string v in response to a client's get(k) operation with opId id1. Then, a get(k) operation with opId value id2, such as id2 > id1, issued by the same client should observe the same or a more recent value for key k.
  • Monotonic writes: Assume that a client performs two put operations: a put(k,v) with opId id1 and a put(k,v') with opId id2 and id1 < id2. Then, put(k,v) must be reflected in the KVS first, before put(k,v'). That is, it should (1) be possible to observe the value v for key k, and (2) key k should have value v' as the final state, assuming no other puts on key k in the system.
  • Frontend ordering: The frontend should process requests from different clients in the order that they are received at the frontend.

Below are a couple of example traces. Each line in the trace is a fully completed operation: the operation was first issued and then returned a ResultStruct. Time flows from top to bottom. There is no concurrency in these examples.

Here is an example of an execution that violates monotonic reads:

  • client1: put(k,v1)
  • client1: put(k,v2)
  • client1: v2←get(k)
  • client1: v1←get(k)
    • A violation since client already observed v2. By having the same client later observing v1, the trace violates monotonicity.

Here is an example of an execution that violates monotonic writes:

  • client1: put(k,v1)
  • client1: put(k,v2)
  • client1: v1←get(k)
    • A violation since v2 should be the final state of key k.

Here is an example of an execution with two clients that satisfies monotonic reads and monotonic writes:

  • client1: put(k,v1)
  • client2: put(k,v2)
  • client1: v1←get(k)
  • client2: v2←get(k)
  • client1: v2←get(k)

KVS distributed operation

Basic distributed dataflow

The diagram above captures the basic operation and interactions between the kvslib, frontend, storage, and persistent disk for two operations (edge colors denote distributed traces): put, followed by a get. The flow is similar to A4: there is a forwarding-style interaction, extending from the kvslib at the client to the frontend to the storage node to the disk, and back. A couple of points of note:

  • In this assignment we do not provide any RPC description for you to follow. So, it's best to think of the diagram above as a conceptual data-flow diagram.
  • As in A4, you will use distributed traces extensively. The distributed tracing library API is unchanged. This spec assumes tracing. That is, we require that all of your RPCs should carry and return tracing tokens, just as was described in A4. Note, however, that tracing does not extend to the disk.
  • The storage node records the put(k,v) state update to disk before replying to the frontend. This is important to provide strong failure semantics. More on failures below.
  • The frontend does not, and should not, implement any key-value state caching. It should forward requests to the storage node.
  • Likewise, the kvslib does not, and should not, implement any key-value state caching.
  • When servicing a get request the storage node does not read the value for key k from disk. Your storage implementation must maintain a consistent in-memory copy of the KVS, which it uses to service all get requests.

Storage node failure and recovery

This assignment features storage failures. There are no kvslib and frontend failures. Only the storage node will fail (you may assume that fail means killing a process here). The storage node can fail at any time. A storage node may later restart. On restart the storage node should always be able to recover any stored state from disk and continue servicing operations. You can use RPCs to detect storage failures from the frontend. If an RPC from the frontend node to the storage node fails, then the frontend must retry exactly once before considering the storage node to have failed. The frontend should retry after waiting for storageTimeout seconds, where storageTimeout is a user-defined parameter. If the retry request fails, the frontend must let the corresponding kvslib instance know about the failure. The kvslib should not perform any retries.

Your KVS state must be durable. This means that the storage node must keep track of put requests and save them (or updated state) to disk before the clients a receives a response. When a storage node is initialized it should checks if there is state on disk (if not, it should initialize empty state) in the directory specified by the configuration and reload the state of the KVS prior to the failure. The following diagram illustrates the interaction between storage and disk (without retries):

Storage-disk interaction

A couple of points to note in the above diagram:

  • There is a notion of initializing the disk state. The implementation of this initialization will depend on how you decide to use the disk.
  • The diagram illustrates one way to implement durability: reply to a put request from storage node only once the write has persisted. There are other, more complex schemes, that may give you a performance boost.
  • After restart, in-memory state must be completely recovered before any new requests are serviced. This again, is one way to ensure correctness, but not the only way.
  • The failure may last an arbitrary amount of time. A restart of the storage node is not guaranteed.
  • Note that the storage node may fail during disk initialization and while it is writing to the disk. The diagram does not illustrate these scenarios, but your implementation must handle them. The storage node should always be able to start successfully and it should not lose data.

Below is an end-to-end view of storage failure. This diagram includes retry actions, but omits some of the above details:

Fail not a fail diagram

  • The above diagram illustrates a corner case in which the kvslib observes a failure of a put operation. However, later, the kvslib observes that the put failure was actually a success. Yes, this is correct behavior. If you would like extra credit (see below), you can build your system to eliminate this type of inconsistent behavior (warning: this is hard).

Non-kvslib API specifications

As in A4 and A5, you will need to complete a client side library (kvslib) that calls the frontend node and performs the operation specified by the client and a storage and frontend library. The kvslib API was described earlier. Unlike A4 and A5, the frontend and storage nodes also have APIs that you must follow. These are simple APIs that initialize and start the frontend/storage daemons. We will use these to start your frontend/storage instances.

  • Frontend
    • err ← Start( clientAPIListenAddr string, storageAPIListenAddr string, storageTimeout uint8, ftrace *tracing.Tracer )
      • When successful, this call should not return. The frontend server should run until the program is terminated. This call should return an error if there are networking issues (e.g., incorrect ip:port given) or other unrecoverable issues.
      • clientAPIListenAddr is the IP:port where the frontend will expect clients to connect to it.
      • storageAPIListenAddr is the IP:port where the frontend will expect the storage node to connect to it. Notice that unlike A4/A5, the frontend node does not know and cannot connect to storage nodes. Storage node must connect to the frontend, instead.
      • storageTimeout is the number of seconds to wait before retrying a failed storage RPC.
      • ftrace is a frontend tracer instance; use this to (1) create a new distributed trace instance using Tracer.CreateTrace(), and (2) record local frontend actions in the created trace that are not associated with any particular client request (see below).
  • Storage
    • err ← Start( frontEndAddr string, storageAddr string, diskPath string, strace *tracing.Tracer )
      • When successful, this call should not return. The storage server should run until the program is terminated. This call should return an error if there are networking issues (e.g., incorrect ip:port given), storage issues (e.g., permission errors), or other unrecoverable issues.
      • frontEndAddr is the IP:port of the frontend node that this storage node should connect to.
      • storageAddr is the local IP:port that this storage node should use to connect to the frontend node.
      • diskPath is a file system path where the storage node can create/read/write files. You can assume that this directory is empty when the storage node is run for the first time.
      • strace is a storage tracer instance; use this (1) create a new distributed trace instance using Tracer.CreateTrace(), and (2) record local storage actions in the created trace that are not associated with any particular client request (see below).

RPC specifications sketch

None. You get to design your own RPCs.

Distributed tracing semantics

Tracing actions in A5

As in A4, all actions that you record in this assignment will be part of distributed tracing. This means you need to use the updated trace-based API for recording actions: Trace.RecordAction(action). The diagram above illustrates all of the actions described below. However, this diagram only illustrates the simple case with no concurrent operations. Additionally, the diagram does not values for opIds, which are necessary to verify monotonicity properties stated informally above (and more formally below). As with A4, keep in mind that all ordering constraints are evaluated according to the happens before relation (i.e., using vector clocks), not physical clock timestamps. Below we use the term 'trace' to refer to a trace that was created by a client during a put or a get operation. We will refer to frontend trace (or ftrace) or storage trace (strace) when discussing actions recorded as part of the frontend/storage traces, which are distinct from client traces (though they are part of the same logical timespace).

Actions to be recorded by kvslib

  • KvslibBegin{clientId}: signifies the start/initialize of a client's kvslib. For each unique ClientId corresponding to a running client, the tracing log as a whole should contain this action exactly once, as part of the kvslib trace. This action should happen-before all other actions recorded by the client with ClientId.
  • KvslibPut{clientId,opId,key,value}: For a given ClientId, Key and Value, kvslib should record this action just before sending a Put operation to the frontend.
    • A trace must contain this action, or a KvslibGet action, exactly once before either of the corresponding KvslibPutResult and KvslibComplete.
    • This action must happen-before a corresponding FrontEndPut and FrontEndPutResult, which must refer to the same key-value pair.
    • A trace must contain this action before StoragePut and StorageSaveData.
  • KvslibGet{clientId,opId,key}: For a given ClientId and Key, kvslib should record this action just before sending a Get operation to the frontend.
    • A trace must contain this action, or a KvslibPut action, exactly once before either of the corresponding KvslibGetResult and KvslibComplete.
    • This action must happen-before a corresponding FrontEndGet and FrontEndGetResult, which must refer to the same key.
    • A trace must contain this action before StorageGet and StorageGetResult.
  • KvslibPutResult{opId,err}: kvslib should record this action just after receiving the Put results from frontend. Here err is a boolean, per storageFail in the ResultStruct.
    • If KvslibPut is present, a trace must contain this action exactly once after KvslibBegin and KvslibPut.
    • A trace must contain this action after corresponding FrontEndPut and FrontEndPutResult.
    • A trace must contain this action after StorageSaveData.
    • Put results must satisfy the monotonic write condition.
      Assume t1 and t2 are two put-related traces writing to same key and generated by the same client. Let t1 contain KvslibPutResult(opId1,err:false) and let t2 contain KvslibPutResult(opId2,err:false) such that opId1 < opId2. Then t1 must contain a StorageSaveData and t2 must contain a StorageSaveData' such that StorageSaveData happens before StorageSaveData'.
  • KvslibGetResult{opId,key,value,err}: kvslib should record this action just after receiving the Get results from frontend. Here value is the value of the key key. If Err is true (operation failed), the value must be nil. value will have the actual value of key key only when Err is false (operation succeeded).
    • If KvslibGet is present, a trace must contain this action exactly once after KvslibBegin and KvslibGet.
    • If Err is false and Value is not nil, then the most recent instance of StorageLoadSuccess{state} (if it exists) or the most recent StorageSaveData{key,_} (if it exists) must associate key with value. One of these StorageLoadSuccess and StorageSaveData actions must exist. If both exist, the latest one (that happens-after the other) is considered, and shall be called the latest relevant store action.
    • If Err is false and Value is nil, then there must neither be an instance of StorageLoadSuccess{state} nor a StorageSaveData that associate key with any value in the past.
    • A trace must contain this action after corresponding FrontEndGet and FrontEndGetResult.
    • A trace must contain this action after StorageGet and StorageGetResult.
    • Get results must satisfy the monotonic read condition.
      Assume t1 and t2 are two get-related traces generated by the same client. Let t1 contain KvslibGetResult(opId1,key,value,err:false) and let t2 contain KvslibGetResult(opId2,key,value',err:false) such that opId1 < opId2. As specified in the second point, both KvslibGetResults must be associated with a latest relevant store action, that associates key with value and key with value' respectively. Given this context, either value' must be equal to value, or the latest relevant store action associated with KvslibGetResult(opId1,key,value,err:false) must happen-before the latest relevant store action associated with KvslibGetResult(opId2,key,value',err:false).
  • KvslibComplete{clientId}: For a specific clientId, kvslib records this action to mark the end of a client session, corresponding to the successful completion of KVS.Close(). For each ClientId corresponding to a running client, the tracing log as a whole should contain this action exactly once, as part of the kvslib trace. This action should happen-after all other actions reported by a given clientId.
For each client, OpIds are unique: Any two traces from the same client that contain opIds, must have distinct opId values.

Actions to be recorded by frontend

  • FrontEndStorageStarted{}: Denotes that the frontend has detected the storage node as being started or restarted. This action should only be recorded in the frontend trace.
    • This action should alternate with FrontEndStorageFailed. No two FrontEndStorageStarted should occur without an intermediate FrontEndStorageFailed.
    • This action should be recorded only if the frontend has successfully executed an RPC against the storage node.
    • This action must be recorded before any FrontEndPutResult(err=false) action.
  • FrontEndStorageFailed{}: Denotes that the frontend has detected the storage node as failed. This action should only be recorded in the frontend trace.
    • This action should alternate with FrontEndStorageStarted. No two FrontEndStorageFailed actions should occur without an intermediate FrontEndStorageStarted.
    • This action should be recorded if the frontend has (1) failed to successfully execute an RPC against the storage node, and (2) has re-attempted the failed RPC after a timeout, and (3) the re-attempted RPC has also failed.
    • This action must be recorded before any FrontEndPutResult{Err: true} or FrontEndGetResult{Err: true} action. Additionally, there must be no FrontEndStorageStarted{} that happens-after this FrontEndStorageFailed{} and happens-before the FrontEndPut/GetResult{Err: true}.
  • FrontEndPut{{key,value}}: In a given trace and with a key & value pair, frontend records this action just after receiving a Put operation from a client.
    • A trace must contain this action exactly once after KvslibBegin and KvslibPut, should KvslibPut be present.
    • A trace must contain this action before FrontEndPutResult.
    • A trace must contain this action before any StoragePut and StorageSaveData, if they are present. If they are present, these subsequent actions must refer to the same key,value pair.
  • FrontEndPutResult{err}: The frontend must record this action after receiving Put results from storage node. Here err is a boolean, per storageFail in the ResultStruct.
    • A trace must contain this action exactly once after KvslibBegin and KvslibPut, should KvslibPut be present.
    • A trace must contain this action after a corresponding FrontEndPut.
    • A trace must contain this action after StoragePut and StorageSaveData.
    • This action should only be recorded with err set to true if the frontend has detected storage as failed and has (unsuccessfully) re-tried the operation. Record of this action indicates that kvslib will record a corresponding later (happens after) KvslibPutResult with err set to true.
  • FrontEndGet{key}: In a given trace and with a key key, frontend records this action just after receiving a Get operation from a client.
    • A trace must contain this action exactly once after KvslibBegin and KvslibGet, should KvslibGet be present.
    • A trace must contain this action before FrontEndGetResult.
    • A trace must contain this action before any instances of StorageGet and StorageGetResult. Any of these subsequent actions should refer to the same key.
  • FrontEndGetResult{key,value,err}: frontend should record this action just after receiving the Get results from storage. Here value is the value of the key key. If err is true (operation failed), the value will be nil. value will have actual value of key key only when err is false (operation succeeded).
    • A trace must contain this action exactly once after KvslibBegin and KvslibGet, should KvslibGet be present.
    • A trace must contain this action after corresponding FrontEndGet.
    • A trace must contain this action after StorageGet and StorageGetResult, and, if err is set to false, this action should report the same key,value pair as as the StorageGetResult.
    • This action should only be recorded with err set to true if the frontend has detected storage as failed and has (unsuccessfully) re-tried the operation. Record of this action indicates that kvslib will record a corresponding later (happens after) KvslibGetResult with err set to true.

Actions to be recorded by storage

  • StorageLoadSuccess{state}: The storage records this action just after successful loading of Key-Value state from disk (or initializing this state, if it does not exist). It should report all key-value pairs that were loaded as state, which is a Go map.
    • This should be the first action recorded by the storage node on start (or restart) and it should be recorded only in the storage trace.
  • StoragePut{key,value}: For a given key-value pair key & value, the storage node records this action just after receiving a Put operation from frontend.
    • A trace must contain this action between zero and two times, after KvslibBegin and KvslibPut, should KvslibPut be present. It may only appear zero times if the storage node is not running or has failed during the put operation; it may only appear more than once if the two occurrences are separated by a StorageLoadSuccess.
    • A trace must contain this action after FrontEndPut and before FrontEndPutResult.
    • Any instances of this action in a trace must happen-before at least one instance of StorageSaveData in the same trace.
  • StorageSaveData{key,value}: The storage records this action just after durably adding/modifying the put value to disk.
    • A trace must contain this action between zero and two times, after KvslibBegin and KvslibPut, should KvslibPut be present. It may only appear zero times if the storage node is not running or has failed during the put operation; it may only appear more than once if the two occurrences are separated by a StorageLoadSuccess.
    • A trace must contain this action after FrontEndPut and before respective FrontEndPutResult.
    • A trace must contain this action after some instance of StorageLoadSuccess, and this action must happen-after at least one instance of StoragePut in the same trace.
    • Effect: across all traces, all instances of StorageGetResult for which this action is the latest relevant store action must report the same key,value pair.
  • StorageGet{key}: For a given key key, the storage node records this action just after receiving a Get operation from frontend.
    • A trace must contain this action between zero and two times, after KvslibBegin and KvslibGet, should KvslibGet be present. It may only appear zero times if the storage node is not running or has failed during the get operation; it may only appear more than once if the two occurrences are separated by a StorageLoadSuccess.
    • A trace must contain this action after FrontEndGet and before FrontEndGetResult.
    • A trace must contain this action after some instance of StorageLoadSuccess, and this action must happen-before at least one instance of StorageGetResult in the same trace.
  • StorageGetResult{key,value}: The storage records this action just after a Get on the in-memory key-value store. Here value is the value of the key key. If key does not exists in key-value map, then value must be nil. Otherwise, value will have actual value of key key.
    • A trace must contain this action between zero and two times, after KvslibBegin and KvslibGet, should KvslibGet be present. It may only appear zero times if the storage node is not running or has failed during the get operation; it may only appear more than once if the two occurrences are separated by a StorageLoadSuccess.
    • Any instance of this action must report the same value for the given key as the latest relevant store action associated with the same key.
    • A trace must contain this action after FrontEndGet and before FrontEndGetResult.
    • A trace must contain this action after some instance of StorageLoadSuccess, and this action must happen-before at least one instance of StorageGet in the same trace.

Assumptions you can and cannot make

  • You can assume that the frontend will start first, followed by the storage node, followed by some number of clients. (But, remember that the storage node may fail and later restart at any time and any number of times.)
  • Except for the frontend retry value of storageTimeout seconds, you should not make any timing assumptions about how long RPCs/communication may take over the network (i.e., assume an asynchronous environment).

Implementation notes

The provided starter code contains four applications. Your task is to implement the frontend app, the storage app, and the kvslib that will compile with a client application. Like A3 and A4, A5 also has a cmd/ directory which contains each application in a separate directory. The four applications are the following:

While you should be making changes to cmd/frontend/main.go and cmd/storage/main.go, you also need to make changes to client application in cmd/client/main.go; all client implementation-related changes should occur in kvslib/kvslib.go.

The kvslib/ directory contains the kvslib package. It defines a KVS struct with stub methods you should implement. This exact API is required, and will be relied upon for grading. You may, however, add any fields you need to the KVS struct, as well as any private helper methods.

The config/ directory contains the configuration files for the client, storage, frontend, and tracing server.

The storage.go and frontend.go contain the configuration and tracing action structs, while client.go implements a client based on the kvslib API.

We provide you with a Makefile to easily build and compile all the above applications.

The go.mod file contains dependency tracking metadata, including a dependency on https://github.com/DistributedClocks/tracing, the tracing library.

The go.sum file contains auto-generated metadata (checksums for module dependencies). You should not need to touch or understand this file.

Testing

We will release some smoke tests for you to test your tracing log files. As before, we recommend that you test your system across a variety of deployments. One of the best ways to test your system is to use the tracing output and assert that events are occurring in the expected order.

Implementation requirements

  • The client code must be runnable on CS ugrad machines and be compatible with Go version 1.15.6.
  • Your code must be compatible with the given APIs, and run successfully against the provided initial code for client, frontend, storage nodes.
  • Your solution can only use standard library Go packages.
  • Your solution code must be Gofmt'd using gofmt.

Handin instructions

As with previous assignments, you will use your personal repository for this assignment under the CPSC416-2020W-T2 org. Your repository will look like cpsc416_a5_USERNAME1_USERNAME2.

We will make the starter code available as part of your handin repository. Keep the provided file layout at the top-level in your solution. You may include additional files, but we do not expect this will be necessary. Do not reference any additional libraries in your solution.

Rough grading scheme

Your code must not change the API, tracing structs and applications' configurations and command-line arguments. Your code must work on ugrad servers. If any of these are violated, your mark for this assignment is 0. This is true regardless of how many characters had to be changed to make your solution compile, and regardless of how well your solution works with a different API or on a different machine.

Both partners in a team will receive the same assignment mark. That is, we will not be doing any accounting of who contributed what to the team's solution.

The high-level A5 mark breakdown looks like this:

  • 20% : Non-concurrent put/get requests without faults
  • 30% : Concurrent put/get requests without faults
  • 30% : Correct failure semantics/recovery at storage node
  • 10% : Correct failure semantics in kvslib and frontend
  • 10% : Recovery in complex fault scenarios
Please make sure that your code is structured according to our instructions. We will deduct 10% if we have to move, rename, or edit files in your repository to make your project work with our autograding scripts.

Performance bonuses. Do you want to earn more points? It's easy, just make your highly concurrent and fault-tolerant and multi-node system fast. How fast? You have to be faster than 90% of the class. That's only 126 other students, or 63 other A5 groups. There are four performance bonuses:

  • Get throughput (5% on top of A5 mark): what's the highest concurrent get request rate that your system can sustain from multiple concurrent clients? Assume a 100% get workload.
  • Put throughput (5% on top of A5 mark): what's the highest concurrent put request rate that your system can sustain from multiple concurrent clients? Assume a 100% put workload.
  • P99 get latency (5% on top of A5 mark): what's your system's 99th percentile end-to-end get request response time?
  • P99 put latency (5% on top of A5 mark): what's your system's 99th percentile end-to-end put request response time?

Everyone's solutions will be evaluated for a bonus. To qualify, your solution must be able to take nil values for tracer/trace instances in the API. That is, we will only run performance tests on your system if your system can disable tracing (not crash with nil values for tracer/trace).

To earn a bonus you must respect the other constraints: (1) no caching of key-value state in the kvslib or in the frontend, (2) no skipping of fault tolerance logic, (3) no assumptions about when the storage node fails, (4) no assumptions about network delay/timing, (5) no changes to APIs or configuration files.

We will not release the performance tests we will use. We will also not release the distribution of key/value sizes nor their pattern. We will also not release our definition of sustain for throughput, though you can think of it as the max request rate that your system achieves before hitting a knee in the performance curve.

Extra credit: 2% of final course mark

An earlier diagram depicted a case where the storage state is inconsistent with what the client observes. Specifically, the system is allowed to fail a client operation due to storage node failure, even though, internally, the operation succeeded. On storage restart this inconsistency becomes visible to the client.

To earn this extra credit you must eliminate this inconsistency problem. You must do so while respecting the other constraints: (1) no caching of key-value state in the kvslib or in the frontend, (2) no arbitrary delaying of client operations (clients should observe storage node failures), (3) no assumptions about when the storage node fails, (4) no assumptions about network delay/timing, (5) no changes to APIs or configuration files, etc.

You must explicitly ask for extra credit evaluation. Please post a private piazza post for this. The extra credit will be marked in a three-step process. First, you must compose and share a short document that reviews your solution. Second, you must schedule to meet with Ivan and do a code review to illustrate how your solution works. Third, at the end of your code review session, you should do a demo that illustrates your solution in practice.



Make sure to follow the course collaboration policy and refer to the submission instructions that detail how to submit your solution.

Note that this is a group assignment requiring two people. We strongly recommend that you work in a team. Please find a partner using post @5 on Piazza.