521 Assignments
Brad Bingham
Code: Download zip of all assignments here. Use "unzip assignments.zip" and then "make all" to compile. Assignments 1 through 6 are run with "mpirun -np [number of processors] a[assignment number].exe". Assignment 9 is run with "a9.exe [rules file]".
Assignments:
- Assignment 1 [5 marks]: The problem is sorting integers, specifically a randomly generated list of N numbers in the range [0, N-1] where N is a define constant in the .c file. This was chosen to use some simple MPI functions to solve a problem I understand well, and is easy to verify. The program sorts the list using P-1 slave processors to sort pieces of the list, and a master processor merges these sorted lists together to give the result. The final sorted list is printed. Writing this program helped me learned how to compile, link, and run MPI. With a large N, the single master processor will be a bottleneck. The algorithm could be made more efficient by merging small sorted lists in stages, possibly in a tree structure. Any algorithm where slaves are not left idle would likely be an improvement. This program can be executed on 2 or more processors.
- Assignment 2 [15 marks]: A graph coloring is an assignment of k colors to vertices in a graph such that no two adjacent vertices are the same color. The graph coloring decision problem is "Can graph G be colored using k colors?"; if the answer is yes, the graph is called k-colorable. This problem is known to be NP-hard. A brute force attempt to solve this problem is to consider all k^N possible colorings, where N is the number of vertices. This exponential blow-up could be more gracefully handled using a parallel algorithm, particularly a processor farm technique.
The graph's vertices is partitioned into P roughly equal sized subsets, each assigned to a processor. All possible colorings are checked on these subgraphs. Such colorings are called valid. Now subgraphs are merged in pairs, and their valid colorings are merged. That is, if p0 has 4 valid colorings and p1 has 6 valid colorings, then the merge of their colorings is 4*6 = 24 possible colorings to check. This list is divided between p0 and p1 and checked for validity. At the next step, another merge takes place and the possible colorings are divided between 4 processors. The algorithm proceeds in this way until no valid colorings remain or a valid coloring is found.
This algorithm is more efficient than the naive brute force method for 2 reasons. One, many colorings may be eliminated early before they are considered. For example, if p0 is given vertices v1 and v2, then p0 will only pass on valid colorings where v1 and v2 have different colors, thus eliminating all colorings over the vertices where v1 and v2 have the same color. Second, while the merged valid coloring list may grow exponentially, it is divided among processors to check for validity. Furthermore, graphs that experience a lot of blow up will tend to be sparse graphs which are obviously k-colorable.
Possible colorings are stored in a color list data structure. This is an integer matrix with number of rows equal to the number of colorings in the list and number of columns equal to the size of the vertex set the colorings are for. Entry (i,j) is the color assigned to vertex j in coloring i. Operations on color lists include:
- Check for validity: input a color list of possible colorings and output a color list of only those colorings that are valid.
- Divide into m equal parts: input a color list and split into m color lists of roughly equal size. This is used when dividing work (checking for validity) among slave processors.
- Merge colorings: once two color lists for different vertices are checked for validity, they are merged to form a list over both sets of vertices which contains all combinations of the original colorings. If color list c1 has a colorings and color list c2 has b colorings, then the merge of c1 and c2 has a*b colorings.
The graph is read from a text file g.txt, where the first 2 numbers are the number of vertices and the number of colors to use, respectively, and then the upper triangular adjacency matrix (see this example). Other details:
- For simplicity, the algorithm expects 2^k processors for integer k >= 1.
- Example of the graph text file for a cycle of length 19 using 3 colors.
- The maximum graph size is 20 vertices; this is defined in the .c file.
This program required much more complicated communication than assignment 1, and gave me experience debugging MPI programs. It was enlightening to write a program where work is dynamically divided among a varying number of slaves. It could be improved by adding some pre-processing to ensure many edges between vertices in the initial subgraphs given to each processor. Also, it should be noted that this program will examine "symmetric" or "permuted" colorings; schemes to eliminate these equivalent colorings could significantly reduce runtime. All MPI communication was done with MPI_Send and MPI_Receive, where more sophisticated calls would decrease communication and simplify the code.
- Assignment 3 [8 marks]: The previous assignment proved to be very challenging and ended up much longer than expected, in terms of lines of code. Also, I was interested to compare performance of multiple sends and receives against scatter, gather and broadcast. Therefore, for this assignment I replaced nearly all sends and receives with other MPI calls and used MPE to check the performance. Note that this change was non-trivial; many loops were removed and the method of packing/unpacking color lists was changed. I learned the dramatic difference in communication performance these sophisticated calls can exhibit over send and receive (see below). These changes allow for the code to be further simplified to cut the lines of code more. Experiments with MPI custom data structures might allow for easier transfer of color lists. Time permitting, these changes could be made.
Here is a2.slog2 and a3.slog2, logfiles generated from assignment 2 and 3, respectively. These were both generated using the same department computers (albani, carlsberg, raffo, moretti, all of which are 1 GHz Pentium III machines with 256 MB memory) on the 19-cycle input graph above. This test strongly suggests a3 gives better performance. This is likely because broadcast and scatter do not incur the overhead of sending p-1 individual messages, overhead associated with steps involved in a send (e.g. copying to/from the NIC). Also, by examining a2.slog2, it appears that sequential sends from one processor to two or more different processors have the property that one send does not begin until the previous send has been received. Broadcast and scatter, by examining a3.slog2, appear to hide this sequential network latency by sending the information in parallel.
- Assignment 4 [15 marks]: Modeling problem 6-18 on page 196 of text "Parallel Programming": write a parallel program to model fish and sharks. An ocean is modeled as a cube of cells where fish and sharks reside and evolve. I choose to do a 2D version so visualization is possible. This was a huge assignment that took much, much longer than expected. The first issue was deciding on a suitable communication model which was challenging. Many such models that have little communication do not properly model the problem. For example, assume we divide the square ocean into a perfect square number of small squares, each belonging to a different processor (in a grid topology). Schemes where each processor evolves its own fish/sharks and then performs communication to surrounding processors does not accurately model the evolution. Among other problems, there are situations where a shark should move across the boundary between the processors to eat a fish, but does not because it is unaware of the fish at that location. Furthermore, schemes where communication occurs after an entire evolution step can have conflicts where (say) two fish move to the same place. Resolving such conflicts is complicated; if we choose to give the conflicted cell to one fish and ask the other to back up, this might not be possible since its old cell may now be occupied.
The model I decided on uses a linear topology instead of a grid. This results in less communication given the protocol below, as less "boundary cells" exist. The ocean is divided into vertical strips of width at least 4 and assigned to processors. Call cells adjacent to a boundary between processors, and all cells adjacent to a cell adjacent to the boundary between processors "boundary cells". Evolution proceeds with a random fish/shark selected for each processor and evolved in some way (moves, dies, etc.) and the action is (possibly) communicated to neighboring processors. This information is ignored if the evolving fish/shark is not in a boundary cell. If it is in a boundary cell, some conflict resolution may be required. A detailed description of the conflict resolution protocol used is below. Once all boundary cell fish/sharks have evolved, no more communication is needed for the rest of the evolution epoch.
The Protocol:
Each processor is assigned a number of consecutive vertical strips of ocean, called R, for x in the range [i,j]. Partition R into 3 sets:
- LB: The 2 vertical strips at x = i and x = i+1
- B: The vertical strips of x in the range [i+2, j-2]
- RB: The 2 vertical strips at x = j-1 and x = j
Note that if R is the leftmost region, then LB is empty, if R is the rightmost region, then RB is empty, and B is empty if the width of R is 4, the minimum width. Call the adjacent region on the left's RB set NLB, and called the adjacent region on the right's LB set NRB. These sets are illustrated below:

The processor responsible for R maintains read-only copies of NLB and NRB so that fish and sharks in LB and RB can make informed decisions.
An epoch is a timestep where each fish and shark in the entire ocean is given a chance to evolve, which corresponds to the rules given on page 191. At the start of an epoch, each processor generates a random ordering on all of its fish and sharks by which they will evolve. In processor k (0 < k < p-1), where p is the number of processors), as longs as there are fish or sharks in RB (LB) that have not evolved, the result of each evolve step much be communicated to processor k-1 (k+1). This must be done since processor k-1 (k+1) must be aware of all movements in LB (RB), and communicating on every evolution is the only way to ensure this. If the movement did not take place in LB (RB), however, the result of the movement is not sent to processor k-1 (k+1); only a signal that nothing of interest took place. Once all fish and sharks that start the epoch in LB (RB) have evolved, we can stop communicating with processor k-1 (k+1).
Conflicts and Resolution:
Define a conflict to be when the evolution of two fish/sharks where has them move to the same cell during an epoch. This includes situations where two fish attempt to move to the same cell, two sharks attempt to eat the same fish, a fish moves to a cell that another fish just exited, but gave birth to a new fish while doing so. Note that in this model each cell may be occupied by at most one fish or shark. If no fish or sharks cross the boundary between processor regions during evolution, no conflicts can take place. Similarly, no conflict can occur when adjacent processors have their fish/sharks cross the boundary at the same time. Conflicts can only occur when adjacent processors have one fish/shark cross the boundary and another fish/shark move adjacent to the boundary. Examples of this are a shark moving across the boundary to eat a fish while the fish moves simultaneously away from the shark, or a fish moving across the boundary to occupy a cell where a fish near the boundary simultaneously just moved to. For simplicity, conflicts are resolved by assuming the fish/shark crossing the boundary acted first.
This conflict resolution scheme requires all fish/sharks evolving within RB or LB to evolve tentatively, that is, decide on a movement but check if a conflict will occur before committing that movement. For example, consider a case where processor k has a fish in LB moving to cell c while processor k-1 has a shark in NLB moving across the boundary to eat said fish. Here, processor k will send a signal to processor k-1 that it is planning a tentative movement (and a signal to processor k+1 that nothing of interest is going on, if fish/sharks remain in RB to evolve). It will then wait for a message from processor k-1; if it happened that both processors were planning tentative moves they would both proceed, since no conflict could occur. However, in this case processor k receives the message that the shark is coming to eat the fish before it moves, so processor k will update its ocean accordingly and acknowledge the eaten fish to processor k-1. Note that processor k may have to send up to 3 messages and receive up to 4 messages per evolution step.
Some miscellaneous implementation details:
- Fish and sharks may only move up, down, right or left and not diagonally.
- Both fish and sharks will give birth with each movement after they reach their breeding age.
- The struct representing fish and sharks is 5 integers for 20 total bytes. Space is allocated only for existing fish and sharks on a processor, and not for empty cells.
- Fish and sharks can be accessed through a dynamic pointer array which points to all living fish and sharks in a given processors ocean. This allows the algorithm to efficiently iterate though them in a random order.
- Fish and sharks can also be accessed through an n by n matrix of pointers so it can quickly be determined if neighboring cells to a given fish/shark is occupied.
- Model parameters are read from ocean.txt. The format is n (the ocean length on one side; ocean has n*n cells), number of fish, number of sharks, fish breeding age, shark breeding age, fish lifetime, shark lifetime (how long shark can live without eating). Then initial placement, in the format of
FISH: < ordered pairs >
SHARKS: < ordered pairs >
See this example.
- Some parameters are defined as constants in a4.c: evolution will continue for 50 epochs (EVOLVE_STEPS), the maximum size of n is 100 (NMAX), the number of seconds to wait between epochs is 2 (SLEEP_TIME).
- The visualization has fish occupied cells colored aqua, shark occupied cells colored red, empty cells colored blue, and boundaries between processors colored magenta.
This assignment was a huge programming task, exercising program design, memory management, and parallel debugging. Probably the most important thing I learned was the importance of a communication model that is correct and efficient. Many models were considered and rejected before designing the simplest one that correctly modeled the problem. This was a challenge to implement, with many bugs occurring from deadlock between communicating processors. I learned how to design a suitable communication model and topology for a problem, implement and debug it, as well as how to do visualization with MPE graphics.
- Assignment 5 [8 marks]: A study of comparing computation to communication tradeoffs. With communication happening at nearly every evolution step, does parallelism actually increase performance? Before running any experiments, I expected that there's 2 cases where parallelism would decrease execution time:
- A sparsely populated ocean. Here there are very few fish/sharks in the boundary strips, so little communication will take place.
- A densely populated ocean with a large number of processors. Here, the time to evolve p fish/sharks in parallel with communication is less than the time to evolve p fish/sharks sequentially on one processor with no communication.
Assignment 4 was modified to not display any graphics or sleep between epochs. The number of epochs (EVOLVE_STEPS) was changed to 1000. The file input format was changed to use a density of fish and sharks so that no initial placement is needed. The format is n (the ocean length on one side; ocean has n*n cells), fish density, shark density, fish breeding age, shark breeding age, fish lifetime, shark lifetime (how long shark can live without eating). Fish and shark density are floating point values corresponding to the percentage of the ocean starting with fish and sharks. These parameters are read from a file "ocean_density.txt".
MPE logging was added to log specific events. Logfiles generated ("thelog.slog") time spent for the initial broadcast of parameters and the scattering of fish in orange, time spent sending in yellow and time spent receiving in red. The entire timeline is colored blue, so any part of the log that is only blue is computation time.
Experiments were conducted using a constant ocean size (n = 50, for n*n = 2500 cells), a varying density of fish (0.1, 0.01, 0.001) and a varying number of processors (1, 2, 4). Parameters in ocean_density.txt were set so that no sharks exist and no fish ever die or give birth. While this is uninteresting as far as modeling, it is important to maintain a constant number of fish in the ocean for accurately measuring performance. Note that non-trivial communication still occurs with these parameters such as tentative movements. Experiments were run on similar department computers (albani, carlsberg, raffo, moretti, all of which are 1 GHz Pentium III machines with 256 MB memory). Results are given below in the form of the .slog2 file and the execution time.
| fish density\# processors |
1 |
2 |
4 |
| 10% |
0.0913589344 s |
2.8096750946 s |
4.4779934509 s |
| 1% |
0.0096779837 s |
0.0809389325 s |
0.7991848985 s |
| 0.1% |
0.0028769533 s |
0.0330070322 s |
0.1777976002 s |
These results clearly show that parallelism is not useful for performance with these parameters. It would likely take much larger values of n and p to show any performance gain. For a given fish density d, the number of fish grows in n as d*n*n, and the number of fish per processor grows as d*n*n/p. If p is some constant fraction of n, say n/8, then the number of fish per processor grows linearly with n. Therefore, for sufficiently large n, parallelism will pay off when c1*d*n*n/p < c2*d*n*n, where c1 is the time to evolve a single fish and communicate the results and c2 is the time to evolve a single fish with no communication. Obviously in the experiments above, c1 is likely to be 2 to 3 orders of magnitude larger than c2; faster communication would reduce this ratio and have communication pay off for more reasonable values of n. We can roughly estimate c1 and c2 given the single processor experiment above at 10% density (0.0913589344 seconds), and another experiment not shown above - 8 processors at 10% density which took 5.0239305695 seconds. An estimate of c1 is
c1 = [total time]/[# of evolution steps]
[# of evolution steps] = [number of epochs]*d*n*n/p = 100*(0.1)*50*50/8 = 3125
c1 = 5.0239305695/3125 = 0.00160765.
An estimate of c2 is
c2 = [total time]/[# of evolution steps]
[# of evolution steps] = [number of epochs]*d*n*n = 100*(0.1)*50*50 = 25000
c2 = 0.0913589344/25000 = 0.000003654.
This gives a required value of p of about 440 for parallelism to pay off. Note the simplifying assumptions in the c1 calculation that every evolve step requires communication (which does not hold unless the last evolved fish in an epoch is in a boundary strip) and that there is good load balancing across processors. Also, the c2 calculation ignores the overhead incurred by running the parallel program on one processor as opposed to writing a sequential program that would be slightly faster.
It should be noted that with interrupt driven communication we could design a much faster algorithm. That is, if we only communicate activity in a boundary strip as opposed to polling all activity until the boundary strip has evolved would result in far less communication overhead.
This analysis could be improved by testing with larger values of n and p that are not practical given time and hardware constraints. Also, testing with a lower latency network would reveal if the new number of processors needed for parallelism to pay off is practical.
- Assignment 6 [9 marks]: Experimentally determining the LogP parameters of a system. I choose this assignment because it would involve investigating the semantics of MPI sends, which I have been curious about from previous assignments. Also, this systems assignment is a change of pace from large programming assignments and involved more thinking and less debugging.
The LogP parameters are (as taken from lecture slides):
- L: Latency. Time spent on the network. During this time the processor can do other work.
- O: Overhead. Processor busy time on the sending or receiving side. During this time, the processor cannot do other work. We distinguish between "send" and "recv" overhead.
- G: Gap. The rate at which messages can be pushed into the network.
- P: The number of processors.
First some experiments were done to determine the if any gap existed. Using 2 processors, MPI_Send was called a large number of times (64,000) on one processor, the sender, and MPI_Recv was called an equal number of times on the other processor, the receiver. Two experiments were timed, one where the sender had no delay between sends and another where there was a small (a few milliseconds) delay between sends. The timing of these experiments showed significant discrepancy, so we conclude that the gap should be measured.
A P value of 4 was fixed before working to find G, O and L. This was used to keep the experimental program(s) simple, but allow for some variation of communication with 2 senders and 2 receivers. Documentation of MPI_Send explains that it blocks in the sense that it copies the given buffer before returning (so the buffer could be re-used), but nonblocking in the sense that it does not wait for the message to be sent. Thus, the time taken to return from MPI_Send is the send overhead plus any applicable gap. MPI_Ssend, however, does not return until receiving has begun. Though the specific actions taken during this kind of send depends on middleware, it is reasonable to assume that it will take twice the latency plus the send overhead time until returning. The first experiment will use MPI_Send with no waiting between calls to determine gap time. The second experiment will use MPI_Send with a small (100 µs) delay between calls to determine send and receive overhead. The third experiment will time MPI_Ssend calls to find the network latency.
Due to the relatively coarse timing in C (time.h), we will need to time many sends and average to find the time of a single send. Since we are trying to isolate only the time taken of MPI_Send (and MPI_Recv), we aim to minimize loop overhead and L1 cache misses. Loop overhead is reduced by unrolling the send loop 8 times; cache misses are addressed below. On 4 processors, the 2 senders exchange the processors they are sending to each loop iteration to introduce some variation. In experiments 1 and 2 we can hide the initial unknown network latency by beginning the sending loop 1 second before starting the receiving loop. In these experiments we use time.h function time() to measure processor time. We cannot use this in experiment 3 since MPI appears to block the process when a send call is waiting for the matching receive call, or vise versa. Instead we use MPI_Wtime(), which returns accurate wall clock time synchronized across all processors. On the department machines used, this time value ticks every 1 µs.
None of the LogP parameters describe possible overhead and latency variations on different sized messages. To examine this variation, the above experiments were run twice: once with messages of 0 size bytes and once with messages of size 1 kilobyte. With the kilobyte data initialized just before the loop and the same buffer input to each MPI_Send, there should be no L1 cache misses.
Output of the 3 experiments is here; we use this to estimate the parameters.
Experiment 1: Here we ignore Wtime and use CPU time. First consider the 0 byte message case: if we average the recorded time of the senders (p0 and p1) we get an average of 6.95 µs. Similarly, averaging the recorded time of the receivers (p2 and p3), we get 7.11 µs. The 1 kilobyte message case gives average send and receive time of 12.67 µs and 15.00 µs, respectively.
Experiment 2: Here we also ignore Wtime and use CPU time. First consider the 0 byte message case: if we average the recorded time of the senders (p0 and p1) we get a send overhead of 1.09 µs. Similarly, averaging the recorded time of the receivers (p2 and p3), we get a receive overhead of 1.02 µs. The 1 kilobyte message case gives send and receive overheads of 2.03 µs and 2.11 µs, respectively. With these overhead, we can subtract the send overhead from the computed times in experiment 1 to give the gap values (see below).
Experiment 3: Here we use Wtime to measure network latency. The average time for synchronous send in the 0 byte case is 290.66 µs. Assuming this time is twice the network latency plus the send overhead, we compute a latency of 144.82 µs. We also compute latency for the 1 kilobyte case (see below). Note that a ping between these computers takes about 230 µs.
The experiment results are summarized here:
| message size\parameter |
send overhead |
receive overhead |
gap |
latency |
| 0 bytes |
1.09 µs |
1.02 µs |
5.86 µs |
144.82 µs |
| 1 KB |
2.03 µs |
2.11 µs |
10.64 µs |
276.97 µs |
It is interesting to note that all parameters increase by roughly a factor of two when increasing the message size from 0 bytes to 1 kilobyte.
This experiment could be improved or extended in many ways. The most straightforward is plotting the computed parameters above against message size for some values larger than 1 KB. The difficulty here is avoiding L1 and L2 cache misses on computers with a data cache of less than 16 KB. To verify the above computation, the assumptions made on semantics of MPI_Send, MPI_Ssend and MPI_Recv would have to be investigated. However, the above programs and analysis give good approximations with these semantics unknown. It would be interesting to try these experiments on another system where the LogP parameters are known in advance and see how accurately they are computed. One way to do this is to compare these experimental results verses those produced by MPI perf.
- Assignment 7 [12 marks]: A theoretical comparison of sorting networks in terms of speed and power consumption. Suggested by Mark Greenstreet. See the .pdf here.
- Assignment 8 [10 marks]: A report on Google's use of parallelism. See the .pdf here.
- Assignment 9 [8 marks]: Deciding if a 1D cellular automaton is reversible in O(n) time. See the .pdf here. The program takes as a command line parameter a filename for a file containing space delimited CA rules on the first line. See the 3rd page of the .pdf for some reversible examples.
Last Updated: April 11th, 10:45 AM, 2006