Mini-tutorial on Time-Space diagrams

This page will introduce you to a way of visualizing an execution of a networked system. At the end of this mini-tutorial, you will answer a few questions to check your understanding of this technique.

One way to visualize an execution of a networked system is to draw a timeline for each host in the system (with time proceeding visually down) . Then, local events for a host can be plotted along the host's timeline, and communication events (i.e., message sends and receives) can be represented as arrows that connect the timelines of the communicating hosts at points of time when they communicated. This visualization is called a time-space diagram.

For example, the figure above shows a time-space diagram that represents the execution of a three-host system in which two clients concurrently buy tickets from a server. The two left-most timelines (grey vertical lines) depict the events executed at the clients -- client 0 and client 1. The third timeline shows events executed by the tickets server. An arrow between two host timelines denotes communication -- the event at the start of an arrow is a send message event, and the event that an arrow points to is a receive message event. In this execution the server has a single ticket, so one of the clients is able to purchase a ticket (it receives a sold message from the server), while the second client is unable to do so (it receives a sold-out response from the server). The only local (non-communication) event is the sold-ticket event at the ticket server.

Informally, the vertical space between events represents time. More formally, the time-space visualization captures the ordering of events in a distributed setting. Two events in the system are ordered if there is a path (along all the edges in the diagram -- both grey and black) from one event to the second, that never goes upward. For example, in the diagram the send of message buy at client 1 happens before the local sold-ticket event at the server. Two events are considered to execute concurrently if there is no such path between the two events. For example, in the diagram the send buy event at client 0 and send buy event at client 1 are concurrent, while the events send buy at client 1 and receive sold-out at client 0 are ordered.


Checking your understanding.

The questions below are intended to test your understanding of time-space diagrams described above. Each question has a single correct answer. You won't be able to move on until you answer these questions correctly.

For the time-space diagram above, is send buy at client 0 concurrent with receive sold at client 1?

Answer: Yes. There is no path connecting these two events, therefore they are concurrent.


If the network could arbitrarily delay some messages, could the server have received the buy message from client 0 before it received the buy message from client 1?

Answer: Yes. Because these two events are sent concurrently, network delay will determine when they are received at the server.



Debugging with time-space diagrams [tspace-saw version]

This page will first specify a networked system in English. Then, it will list a number of observed executions of the system as time-space diagrams. Finally, you will answer a few questions based on the English specification of the system and the time-space diagrams.

The stop-and-wait protocol.

The stop-and-wait (SAW) protocol implements reliable data communication between two hosts -- the sender and the receiver. However, the intermediate channel between the two hosts can lose messages. That is, if the sender sends a data in message m, then the receiver may or may not receive m on its end.

The SAW protocol implements a strategy that delivers data over lossy channels reliably. To accomplish this, the receiver notifies the sender whenever it receives a message by replying with an acknowledgement message. However, the reverse channel is also lossy, so the SAW protocol must be resilient to both loss of sender messages and loss of receiver acknowledgments. To overcome this, the hosts append a single bit to each message and its acknowledgement -- a bit of 0 is used for odd data transmissions (data that is transmitted 1st, 3rd, 5th, and so on), while bit 1 is used for even data transmissions (2nd, 4th, 6th, etc).

For example, assume that the sender needs to deliver a single data item x to the receiver. Since this is the first (odd) data transmission, the sender first sends a message x0 (data x with bit 0 appended to the end). Next, the sender waits for the receiver to reply with ack0 (acknowledgement of x0). If the sender does not receive ack0 after a certain timeout, then the sender assumes that x0 was lost and re-sends x0. It re-sends x0 until it receives ack0, after which point it considers x as having been received.

Now, assume that the sender needs to deliver two data items -- x and y. The sender can send x using the strategy above. But, next, when it transmits y, it will send y1 (data y, with bit 1 appended to the end), since y is a second (even) data transmission. The sender will then expect the receiver to respond with ack1 (acknowledgements of the last message with bit 1). If it receives ack0 (that is acknowledging the previous message, x0), it will ignore it. And, as before, the sender will retransmit y1 after a timeout.


Debugging task.

Based on the system description above and the observed time-space diagrams of the system below, answer the question below.



List all of the time-space diagrams above that you think do not conform to the description of the system above. What made you choose these diagrams?

Answer: D, E, F, H. In each of these four time-space diagrams the Sender host times out (local timeout event). The description mandates that the Sender re-send the previous message (x0 or y0), but in each of these diagrams the sender never re-sends the message.



Debugging with time-space diagrams [tspace-vol version]

This page will first specify a networked system in English. Then, it will list a number of observed executions of the system as time-space diagrams. Finally, you will answer a few questions based on the English specification of the system and the time-space diagrams.

The Voldemort distributed storage system.

Voldemort is a distributed storage systems that consists of some number of replica hosts, and one client host. The client can either associate a value with a key and store it at a replica, or retrieve a value from a replica based on a key. The client does this by issuing two kinds of commands -- put(key,value) (store value and associate it with key), and get(key) (retrieve a value associated with key). In response to a put or a get command, the client responds with an ack message.

When the Voldemort system has more than one replica, the client uses all of the replicas to store and retrieve data. This makes the systems more fault-tolerant -- if a replica fails, the client's data persists.

To store data at multiple replicas, the client issues a put command to each replica. It waits for the replica's ack response before issuing a put to the next replica. Once the client executed the put on all replicas, it considers the data to be stored.

Similarly, to retrieve data from multiple replicas, the client issues a get command to each replica. It waits for a replica to reply before issuing the next get. Once the client has retrieved all of the values, it decides to use the value that appears most frequently among the set of values that it has collected.


Debugging task.

Based on the system description above and the observed time-space diagrams of the system below, answer the question at the bottom.



List all of the time-space diagrams above that you think do not conform to the description of the system above. What made you choose these diagrams?

Answer: C, E. According to the description, after sending a put or a get request, the client must wait for an ack from a replica before sending the next request. These two time-space diagrams contradict this requirement.



Mini-tutorial on Communicating Finite State Machines

This page will introduce you to one way of modeling networked systems. At the end of this mini-tutorial, you will answer a few questions to check your understanding of this modeling formalism.

A communicating finite state machine (CFSM) is a model that depicts a fixed set of hosts. Hosts communicate with one another by sending and receiving messages on uni-directional channels. Each host in a CFSM is represented as a finite state machine. For example, the figure above shows a CFSM with two hosts -- Host A and Host B. These two hosts are connected with channels c and d.

In a host finite state machine, circles represent states (e.g., host A has two states -- a0 and a1). And, arrows depict transitions on events between states. The host machines execute discretely and asynchronously by taking transitions from one state to the next, starting from an initial state and terminating in the terminal state. The initial state is indicated by a floating arrow that points to it (e.g., a0 is the initial state at host A). The terminal state is indicated by two concentric circles (e.g., b0 is the terminal state at host B).

Hosts in a CFSM have two kinds of communication events that allow them to exchange messages. The send of a message m on channel c is expressed as c!m (the exclamation mark indicates a send). While the receive of a message m on channel c is indicated by c?m (the question mark indicates a receive).

For example, in the CFSM above, host A can send a message m to host B with the event c!m. This would transition host A from state a0 to state a1. Host B can receive m on channel c by transitioning from b0 to b1 (event c?m). Next, host A must stay in a1, waiting for an ack message from host B. Host B can reply with an ack on channel d -- d!ack, which would transition it from b1 to b0. Finally, host A can receive the ack and transition back to a0.

The above sequence is a valid execution of the system because it start with all hosts in initial states, terminates with all hosts in terminal states, and has legal intermediate transitions. This execution can be represented more compactly as the sequence "c!m,c?m,d!ack,d?ack". On the other hand, the execution "c?m,d!ack,d?ack" is not a valid execution (host B cannot receive message m since host A never sent it).


Checking your understanding.

The questions below are intended to test your understanding of the CFSM formalism described above. Each question has a single correct answer. You won't be able to move on until you answer these questions correctly.

For the CFSM above, is the execution "c!m,d!ack,d?ack" legal?

Answer: No. Host B cannot send an ack as its first event.


In the CFSM above, what is the maximum number of messages that channel c can contain at any point of time?

Answer: 1. There can never be more messages in channel c because Host A must first receive an ack before sending another messages on this channel. But, to reply with such an ack, Host B must first receive the outstanding message on channel c.



Debugging with a CFSM model [cfsm-vol version]

This page will first specify a networked system in English. Then, it will present a CFSM model that was derived from observed executions of the system. Finally, you will answer a few questions based on the English specification of the system and the CFSM model.

The Voldemort distributed storage system.

Voldemort is a distributed storage systems that consists of some number of replica hosts, and one client host. The client can either associate a value with a key and store it at a replica, or retrieve a value from a replica based on a key. The client does this by issuing two kinds of commands -- put(key,value) (store value and associate it with key), and get(key) (retrieve a value associated with key). In response to a put or a get command, the client responds with an ack message.

When the Voldemort system has more than one replica, the client uses all of the replicas to store and retrieve data. This makes the systems more fault-tolerant -- if a replica fails, the client's data persists.

To store data at multiple replicas, the client issues a put command to each replica. It waits for the replica's ack response before issuing a put to the next replica. Once the client executed the put on all replicas, it considers the data to be stored.

Similarly, to retrieve data from multiple replicas, the client issues a get command to each replica. It waits for a replica to reply before issuing the next get. Once the client has retrieved all of the values, it decides to use the value that appears most frequently among the set of values that it has collected.


Debugging task.

Based on the system description above and the CFSM model of the system that is constructed from observations of the system below, answer the question below.



How does the observed model differ from the intended system description?

Answer: According to the description, after sending a put or a get request, the client must wait for an ack from a replica before sending the next request. The CFSM model does not include this constraint. For example, the client can send two get requests back-to-back.



Debugging with a CFSM model [cfsm-saw version]

This page will first specify a networked system in English. Then, it will present a CFSM model that was derived from observed executions of the system. Finally, you will answer a few questions based on the English specification of the system and the CFSM model.

The stop-and-wait protocol.

The stop-and-wait (SAW) protocol implements reliable data communication between two hosts -- the sender and the receiver. However, the intermediate channel between the two hosts can lose messages. That is, if the sender sends a data in message m, then the receiver may or may not receive m on its end.

The SAW protocol implements a strategy that delivers data over lossy channels reliably. To accomplish this, the receiver notifies the sender whenever it receives a message by replying with an acknowledgement message. However, the reverse channel is also lossy, so the SAW protocol must be resilient to both loss of sender messages and loss of receiver acknowledgments. To overcome this, the hosts append a single bit to each message and its acknowledgement -- a bit of 0 is used for odd data transmissions (data that is transmitted 1st, 3rd, 5th, and so on), while bit 1 is used for even data transmissions (2nd, 4th, 6th, etc).

For example, assume that the sender needs to deliver a single data item x to the receiver. Since this is the first (odd) data transmission, the sender first sends a message x0 (data x with bit 0 appended to the end). Next, the sender waits for the receiver to reply with ack0 (acknowledgement of x0). If the sender does not receive ack0 after a certain timeout, then the sender assumes that x0 was lost and re-sends x0. It re-sends x0 until it receives ack0, after which point it considers x as having been received.

Now, assume that the sender needs to deliver two data items -- x and y. The sender can send x using the strategy above. But, next, when it transmits y, it will send y1 (data y, with bit 1 appended to the end), since y is a second (even) data transmission. The sender will then expect the receiver to respond with ack1 (acknowledgements of the last message with bit 1). If it receives ack0 (that is acknowledging the previous message, x0), it will ignore it. And, as before, the sender will retransmit y1 after a timeout.


Debugging task.

Based on the system description above and the CFSM model of the system that is constructed from observations of the system below, answer the question at the bottom.



How does the observed model differ from the intended system description?

Answer: There are two issues. (1) The states s1 and r1 should be terminal states. Otherwise, the sender would be obligated to send an even number of messages before terminating. (2) The timeout loops on states s1 and s3 do not conform to the specification. Instead, a timeout event in s1 should transition the sender to s0. Similarly, a timeout event in s3 should transition the sender to s2.