A list of recent publications.
A significant number of parallel applications are implemented using MPI (Message Passing Interface) and several existing approaches focus on their verification. However, these approaches typically work with complete applications and fixing any undesired behaviour at this late stage of application development is difficult and time consuming. To address this problem, we present a lightweight formal approach that helps developers build safety into the MPI applications from the early stages of the program development. Our approach consists of a methodology that includes verification during the program development process. We provide tools that hide the more difficult formal aspects from developers making it possible to verify properties such as freedom from deadlock as well as automatically generating partial skeletons of the code. We evaluate our approach with respect to its ability and efficiency in detecting deadlocks.
In this paper we present a distributed, in-memory, message passing implementation of a dynamic ordered dictionary structure. The structure is based on a distributedfine-grain implementation of a skip list that can scale across a cluster of multicore machines. We present a service-oriented approach to the design of distributed data structures in MPI where the skip list elements are active processes that have control over the list operations. Our implementation makes use of the unique features of Fine-Grain MPI and introduces new algorithms and techniques to achieve scalable performance on a cluster of multicore machines. We introduce shortcuts, a mechanism that is used for service discovery, as an optimisation technique to trade-off consistency semantics with performance. Our implementation includes a novel skip list based range query operation. Range-queries are implemented in a way that parallelises the operation and takes advantage of the recursive properties of the skip list structure. We report the performance of the skip list on a medium sized cluster with two hundred cores and show that it achieves scalable performance.
In this paper we introduce a service-oriented approach to the design of distributed data structures for MPI. Using this approach we present the design of an ordered linked-list structure. The implementation relies on Fine-Grain MPI (FG-MPI) and its support for exposing fine-grain concurrency. We describe the implementation of the service and show how to compose and map it onto a cluster. We experiment with the service to show how its behaviour can be adjusted to match the application and the underlying characteristics of the machine. The advantage of a service-oriented approach is that it enforces low coupling between components with higher cohesion inside components. As a service, the ordered linked-list structure can be easily composed with application code and more generally it is an illustration of how complex data structures can be added to message-passing libraries and languages.
MPI implementations typically equate an MPI process with an OS-process, resulting in a coarse-grain programming model where MPI processes are bound to the physical cores. Fine-Grain (FG-MPI) extends the MPICH2 implementation of MPI and implements an integrated runtime system to allow multiple MPI processes to execute concurrently inside an OS-process. FG-MPI's integrated approach makes it possible to add more concurrency than available parallelism, while minimizing the overheads related to context switches, scheduling and synchronization. In this paper we evaluate the benefits of added concurrency for cache awareness and message size and show that performance gains are possible by using FG-MPI to adjust the grain-size of a program to better fit the cache and potential advantages in passing smaller versus larger messages. We evaluate the use of FG-MPI on the complete set of the NAS parallel benchmarks over large problem sizes, where we show significant performance improvement (20%-30%) for three of the eight benchmarks. We discuss the characteristics of the benchmarks with regards to trade-offs between the added costs and benefits.
Fine-grain MPI (FG-MPI) extends the execution model of MPI to allow for interleaved execution of multiple concurrent MPI processes inside an OS-process. It provides a runtime that is integrated into the MPICH2 middleware and uses light-weight coroutines to implement an MPI-aware scheduler. In this paper we describe the FG-MPI runtime system and discuss the main design issues in its implementation. FG-MPI enables expression of function-level parallelism, which along with a runtime scheduler, can be used to simplify MPI programming and achieve performance without adding complexity to the program. As an example, we use FG-MPI to re-structure a typical use of non-blocking communication and show that the integrated scheduler relieves the programmer from scheduling computation and communication inside the application and brings the performance part outside of the program specification into the runtime.
Building clusters from commodity off-the-shelf parts is a well-established technique for building inexpensive medium- to large-size computing clusters. Many commodity mid-range motherboards come with multiple Gigabit Ethernet interfaces, and the low cost per port for Gigabit Ethernet makes switches inexpensive as well. Our objective in this work is to take advantage of multiple inexpensive Gigabit network cards and Ethernet switches to enhance the communication and reliability performance of a cluster. Unlike previous approaches that take advantage of multiple network connections for multi-railing, we consider CMT (Concurrent Multipath Transfer) that extends SCTP (Stream Control Transmission Protocol), a transport protocol developed by the IETF, to make use of the multiple paths that exist between two hosts. In this work, we explore the applicability of CMT in the transport layer of the network stack to high-performance computing environments. We develop SCTP-based MPI (Message Passing Interface) middleware for MPICH2 and Open MPI, and evaluate the reliability and communication performance of the system. Using Open MPI with support for message striping over multiple paths at the middleware level, we compare the differences in supporting multi-railing in the middleware versus at the transport layer