Failure Transparency in Remote Procedure Calls
Remote procedure call (RPC) is a communication abstraction widely used in distributed programs. The general premise entwined in existing approaches to handle machine and communication failures during RPC is that the applications which interface to the RPC layer cannot tolerate the failures. The premise manifests as a top level constraint on the failure recovery algorithms used in the RPC layer in these approaches. However, our premise is that applications can tolerate certain types of failures under certain situations. This may in turn relax the top level constraint on failure recovery algorithms and allow exploiting the inherent tolerance of applications to failures in a systematic way to simplify failure recovery. Motivated by the premise. The paper presents a model of RPC. The model reflects certain generic properties of the application layer that may be exploited by the RPC layer during failure recovery. Based on the model a new technique of adopting orphans caused by failures is described. The technique minimizes Ihe rollback which may be required in orphan killing techniques. Algorithmic details of the adoption technique are described followed by a quantitative analysis. The model has been implemented as a prototype on a local area network. The simplicity and generality of the failure recovery renders the RPC model useful in distributed systems. particularly those that are large and heterogeneous and hence have complex failure modes.