MPI-SCTP
Using the Stream Control Transmission Protocol (SCTP) for parallel
programs
written using the Message Passing Interface (MPI)
Goal:
This page documents our experiences with incorporating
SCTP and its features
into MPI, focusing particularly on the execution of MPI over a network. We
hope that our documented experiences here can benefit and interest
not only people directly
interested in MPI, but also developers of other network applications curious
about SCTP.
Description
TCP has always been the de facto transport protocol when network applications
are run. This began initially with local area networks and for various
network applications, has continued in wide area networks characterized often
by higher latency/loss links. Here, TCP has been tuned extensively in order
to maximize throughput. The same is true of parallel applications where
various socket settings and sysctl settings have been appropriately
tweaked.
Previous work using SCTP under high latency/loss links has shown drastic
increases in performance over TCP. Initially developed as a solution to
signal processing, SCTP has been shown to be useful in other contexts as
well such as
FTP,
HTTP,
satellite networks, etc. The question became, why
don't we use SCTP to execute latency tolerant parallel program over the
Internet, itself having high latency/loss?
SCTP
SCTP is a transport layer protocol comparable to TCP and UDP. It takes aspects
from each of these and also adds its own features. SCTP has been incorporated
into all major operating systems yet some have asked,
"Why is SCTP needed given TCP and UDP are widely available"? Others have
answered this.
Tutorials (
1,
2)
and RFCs
(
3286,
2960)
exist that fully outline the benefits of the protocol.
Further, the book
Stream Control Transmission Protocol (SCTP): A Reference
offers more thorough background about SCTP. Here, we do not intend
to fully repeat what these accomplish but try to discuss our own experiences
assuming a background of the protocol exists.
Why did we envision SCTP being a benefit for MPI? Several reasons jumped out
at first. By nature, MPI passes messages, so TCP implementations require
additional framing within the middleware. SCTP is message-based so the thought
was that this framing could be off-loaded to the transport protocol. The same
could be accomplished using UDP but unreliably.
Some of SCTP's features make it attractive for use over WANs. First off, it's
multihoming and multistreaming features make it less susceptible to loss
and latencies, traits often characteristic in WANs. Additionally, SCTP is
overall more secure. For example, its four-step initiation sequence makes
it avoid TCP SYN-like DoS attacks.
In the MPI community, attempts for increased fault tolerance and bandwidth
utilization have been researched. Most notably there has been the work by the
OpenMPI community. OpenMPI provides
an approach called
TEG
that manages a node's interfaces in
the MPI middleware. It stripes data over each available path, and collects
the data on the receiving side for reassembly. In addition to providing
fault tolerance, this approach maximizes the bandwidth utilization on the
connection. SCTP's multihoming feature itself provides similar fault
tolerance in the transport protocol. Each endpoint is represented by a set
of local addresses, and in the case of network failure, SCTP multihoming
provides fast path failover. TEG's ability to stripe across multiple paths
is also reminiscent to ongoing research in the SCTP community, namely
CMT, or
Concurrent Multipath Transport.
This work incorporates data striping into the transport layer. TEG has the
inherent advantage of allowing data striping across different protocols
(Infiniband, TCP, Myrinet, etc.) but CMT may have a performance advantage of
being implemented in kernel space. Unfortunately, no direct performance
comparisons could yet be made between the SCTP's kernel space and TEG's user
space approaches at the time of this writing because no public release of
CMT was available.
So for us the general theme for using SCTP was :
SCTP let's us off-load useful functionality from the MPI
middleware down into the transport layer.
Perhaps the same can be said of your own network application. For us, using
SCTP opens up the door for the execution of parallel programs in distributive
environments.
Learning about the SCTP API
We first learned about SCTP in the latest edition of the Steven's book
Unix Network Programming, 3rd Edition.
It points out that network applications using SCTP can be coded using two
API styles, 1) the "TCP-like" one-to-one style as well as 2) the "UDP-like"
one-to-many style. We used this as valuable resource for learning about
the two coding styles. Another invaluable source was reading the
latest SCTP API draft off of the IETF website.
SCTP Implementations
Several implementations of SCTP exist. Some are fully listed off of
this link.
The original SCTP book by Stewart and Xie provided an example user space
implementation. User space implementations need access to raw sockets, so
either applications must be run as root or the kernel must have raw IP
access for users compiled in. Recently, the authors' work has been more
focused on
kernel implementations, for example, we know that Randall Stewart's work
for a kernel-based implementation of SCTP has primarily been on the BSD
KAME stack. We have also
worked extensively with the Linux
LKSCTP as well.
Experiences with SCTP
Generally, performance is VERY stack dependent.
In our experience, the KAME.net BSD stack has well out-performed the
LKSCTP Linux stack, although performance improvements are expected to come
for all stacks with time as the protocol is still quite young. Through
conversations with members of the SCTP community at the 64th IETF in
Vancouver, we learned that Sun Solaris 10 and Mac OS 10 ship an SCTP
implementation out-of-the-box, and that the performance results of these
implementations are on par with BSD.
Recent
tests have compared the BSD/Mac implementation to Sun and Linux
and these trends were confirmed.
MPI
We started out by focusing on the open source
LAM/MPI implementation of MPI due to
its modularity and helpful users' community. We extensively studied
the TCP module of the Request Progression Interface (RPI). After
understanding this in addition to the use of SCTP, we undertook the design
process of incorporating SCTP in the LAM RPI.
Papers
2005
Humaira Kamal, Brad Penoff, and Alan Wagner. SCTP-based Middleware for MPI
in Wide-Area Networks. In Proceedings of the IEEE Conference on
Communication Networks and Services Research (CNSR2005), Halifax, CANADA,
May 2005. Full text of the paper is available at IEEE Xplore (Click here)
This paper gives an overview of SCTP and MPI. It then carefully describes
the internals of the LAM/MPI TCP RPI implementation. Bearing this in mind, it
then describes how SCTP and its various features were iteratively incorporated
in the LAM/MPI TCP RPI. This paper focuses on the design of these various
iterations and describes how each iteration benefits from its additions.
Humaira Kamal, Brad Penoff, and Alan Wagner. Evaluating Transport Level
Protocols for MPI in the Internet. In Proceedings of the International
Conference on Communications in Computing (CIC 2005), Las Vegas, Nevada USA,
June 2005.
This paper focuses on the execution of the popular
NAS Parallel Benchmarks
on networks having high latency/loss. We examine TCP NewReno and TCP SACK
looking at the effects on each of various socket and sysctl settings
such as TCP_NODELAY and the use of delayed acknowledgments.
(PDF | talk)
Humaira Kamal, Brad Penoff, and Alan Wagner. SCTP versus TCP for MPI.
Proceedings of Supercomputing 2005 (SC2005), Seattle, Washington USA, November 2005. Best Student Paper Award finalist (top 4 of 60+).
Here, we focus on the implementation issues and resulting performance
results of our implementation of MPI over SCTP, making frequent comparisons
to its TCP counterpart. We illustrate the benefits SCTP offers
through emulation using
Dummynet.
2006
(PDF)
Humaira Kamal, Brad Penoff, Mike Tsai, Edith Vong, Alan Wagner.
Using SCTP to hide latency in MPI programs. Accepted to
HCW 2006 and to appear in
the Proceedings for IPDPS 2006, Rhodes, GREECE, April 2006.
This paper uses our SCTP-based middleware as a product and shows the benefits
in using tags to define independent streams in a task farm. Benefits are
shown fitting the popular mpiBlast program as well as other
real applications into our task farm model.
(PDF)
Brad Penoff and Alan Wagner.
Towards MPI progression layer elimination with TCP and SCTP.
Accepted to
HIPS 2006 and to appear
in the Proceedings for IPDPS 2006, Rhodes, GREECE, April 2006.
This paper reviews existing MPI middleware designs and then presents a
TCP-based design where the message
progression layer is fully eliminated from the MPI middleware. Its
functionality is either pulled up into the MPI library or pushed down
onto the protocol stack resulting in a thinner design that can more easily
leverage network protocol improvements and advancements in commodity networking. Later, this
design is compared to our SCTP-based design. It is shown that the
SCTP-based design thins the message progression layer, however full
elimination is not possible using SCTP one-to-many socket style unless
additional functionality is provided by the SCTP socket API.
2007
(PDF)
Brad Penoff, Mike Tsai, Janardhan Iyengar, and Alan Wagner.
Using CMT in SCTP-based MPI to exploit multiple interfaces in
cluster nodes.
In the Proceedings of
EuroPVM/MPI 2007,
Paris, FRANCE, Sept 2007.
2008
(PDF|talk)
Mike Tsai, Brad Penoff, and Alan Wagner.
A Hybrid MPI Design using SCTP and iWARP.
In Communication Architecture for Clusters (CAC): Proceedings of the 2008
IEEE International Parallel and Distributed Processing Symposium (IPDPS),
Miami, Florida, USA, April 2008.
2009
(PDF
|talk)
Brad Penoff, Alan Wagner, Michael Tuexen, and Irene Ruengeler.
MPI-NeTSim: A network simulation module for MPI.
In the 15th IEEE International Conference on Parallel and Distributed Systems (ICPADS'09)
, Shenzhen, CHINA, December 2009.
2010
(DOI)
Brad Penoff, Humaira Kamal, Alan Wagner, Mike Tsai, Karol Mroz, and Janardhan Iyengar.
Employing Transport Layer Multi-railing in Cluster Networks.
In the Journal of Parallel and Distributed Computing (JPDC)
Volume 70, Issue 3, March 2010, Pages 259-269.
2011
(PDF (to appear))
Irene Ruengeler, Michael Tuexen, Brad Penoff, and Alan Wagner.
A New Fast Algorithm for Connecting the INET Simulation Framework to Applications in Real-time.
In the fourth International ICST Conference on Simulation Tools and Techniques (SIMUTools 2011),
March 2011.
2012
(PDF)
Irene Ruengeler, Michael Tuexen, Brad Penoff, and Alan Wagner.
Portable and Performant Userspace SCTP Stack.
In the IEEE International Conference on Computer Communication Networks (ICCCN 2012),
July 2012.
Presentations
CNSR2005, Halifax, Canada - May 2005
(PowerPoint)
Ohio State visit, Columbus, Ohio - Oct 24, 2005
(PowerPoint)
SC|05, Seattle, Washington, USA - Nov 16, 2005
(PowerPoint)
VanHPC meeting, Vancouver, BC, CANADA - March 15, 2006 (PDF, PowerPoint)
HCW 2006, Rhodes, GREECE - April 25, 2006 (PowerPoint)
HIPS 2006, Rhodes, GREECE - April 25, 2006 (PowerPoint)
Argonne National Laboratory, Chicago, Illinois, USA - September 7, 2006
Google, Seattle, Washington, USA - June 23, 2007 (PDF)(Google Video|YouTube|embedded)
Euro PVM/MPI 2007, Paris, FRANCE - October 1, 2007
CAC 2008, Miami, Florida, USA - April 14, 2008 (PowerPoint)
ICPADS 2009, Shenzhen, CHINA - December 11, 2009 (PDF)
Technical Documents
Description of LAM TCP RPI Module
(PDF)
Future Events
ICCCN 2012
Past Events
SCTP Interop. Event (July 30st- August 4th, 2006)
(WEBSITE)
The standard MPICH2
1.0.5 release includes our SCTP channel for MPICH2's ch3 device.
Software
Open MPI
Open MPI On Nov 13, 2007, the initial
SCTP BTL was committed to ompi-trunk during
changeset
16723.
MPICH2
MPICH2 1.0.5 will
include an SCTP channel for MPICH2's ch3 device. Directions for using and
compiling this are provided in the README.
LAM/MPI
Our initial prototype discussed in our SC|05 paper was implemented within
LAM/MPI. Our modified LAM/MPI used in this paper is now available:
LAM/MPI 7.0.6 with SCTP - Email us for the tarball!
LAM/MPI when run using the TCP RPI provides concurrency at the process level
since each pair of processes have a socket on each side associated with their
connection. TCP provides a full ordering on these connections, however this
is more strict than those required by MPI. MPI requires that messages
passed with the same tag, rank, and context (TRC) maintain ordering. Our
implementation provides this TRC level concurrency.
Although contrived, the following simple example program illustrates how
head-of-line blocking can occur in TCP-based RPIs, something our SCTP-based
middleware avoids. The same communication pattern of this latency
tolerant program could conceivably be present in real applications. This
program is a way to implement Figure 5 in our SC|05 paper.
waitany - download the example program here
Future Work
We want to have several topics that we want to investigate in the future. A
small subset of them include:
- incorporating SCTP into OpenMPI
- testing SCTP CMT vs. OpenMPI TEG vs. channel bonding for data striping in
standard MPI programs
- becoming a customer of our own product and work on modifying real
applications to benefit from SCTP's benefits
- continue to look for opportunities to take advantage of standard IP
transport protocols for MPI
- looking at the Sun Solaris 10 SCTP implementation and test its performance
- looking at MAC OS 10 SCTP implementation and test its performance
- test on real wide area networks as opposed to emulation
People
Humaira Kamal (web)
Brad Penoff (web)
Alan Wagner (web)
Mike Yao Chen Tsai
Edith Vong
Contact
If you have questions regarding anything, feel free to email us at mpi hyphen sctp at cs dot ubc dot ca.
Last Updated: Sept 1, 2008