Projects!

The following are a set of incubatory project ideas that I would be interested in supervising Masters or PhD-level research projects on. These aren't the only things that I'm interested in supervising, but if you are having trouble thinking up an interesting thing on your own, they may be of interest. Note that some of the things here are related to existing projects in the lab, and some are ideas that I've discussed with other faculty -- I've tried to identify both of these classifications wherever possible.

If any of these seem interesting to you, feel free to drop me an email.

Remus, SecondSite, High-Availability, Disaster Recovery

The Remus project built Extensions to Xen that continusously replicated a running virtual machine onto a second physical computer, and allows you to pull the plug on one machine and have your VM seamlessly continue execution on a the backup. This was Brendan Cully's masters project, and it is an example of what I consider to be an excellent masters thesis: It's a complete thing, it's being released to open source, and it covers a range of topics in systems, including virtual memory, networking, and storage. Brendan's thesis is well worth reading if you are looking for an example Master's thesis.

Ryan O'Connor is an undergraduate research assistant in the Lab that has been working on making remus work over a long-distance link between Vancouver and Kamloops. If you are interested in this class of highly-available system, Brendan and Ryan are both great people to chat to for ideas and to find out about the code.

There are a bunch of interesting things to be done on Remus:

Disaster Tolerant Computing. Ryan's current work on Remus could quite easily be turned into a number of masters-thesis-level things. We're trying to make Remus run well when the pair of physical machines are in geographically isolated locations, hundreds of kilometers apart. We have built a testbed of two machines, one in Vancouver and one at TRU in Kamloops, and are able to failover VMs between them. However, there are a number of challenges to solve here:

Better Mousetraps. Remus's big benefit is that it doesn't require that applications or OSes be changed in order to support high availability. It's biggest weakness is that the transparently currently comes at a high cost: we copy lots of data, and incur a large overhead for output commit to prevent people from seeing speculative state. Christopher Head, an undergrad research assistant in the lab, has been looking in to ways to fix these issues:

Byzantine Remus. Well, okay, not Byzantine Remus -- who would be crazy enough to do that -- but how about non-fail-stop HA? the current system assums that failure is absolute, and that it is from hardware. If you crash or have software problems, you are just going to replicate those problems onto the other system and have absolutely no help. It would be very interesting to think about whether Remus can be used to help recover from software errors, by rewinding and replaying systems that crash. People interested in this area should look at the SOSP Rx paper, OSDI Failure Oblivious Computing, and maybe also Triage. This is work that I am currently doing with Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Chris Head, Ryan O'Connor, Norm Hutchinson, and Mike Feeley.

HindSight: Understanding what software is doing

We don't really know what applications are doing. We know what the are supposed to do, and we have their source, which is a vague codification of what they are supposed to do. Unfortunately, when applications crash or are attacked, they generally don't do what they are supposed to and we are left staring into a smoking hole trying to figure out what went wrong.

What if you had a detailed recording of everything your applications and OS had done for weeks or months of execution? What if you could issue queries against this trace data to understand execution? What if these traces were so detailed that you could use tehm to reconstruct a running version of the system at any point in time, and resume execution in order to interact with it and ask questions? How would you use this data to improve software understanding, debugging, security, etcetera?

We have build an initial version of this system and are starting to build tools on top of it. Areas of particular interest include: This is work that I am currently doing with Geoffrey Lefebvre, Dutch Meyer, Brendan Cully, Norm Hutchinson, and Mike Feeley.
Exception and Black Boxes

Virtual machines are really good at isolating software. This is why there is a lot of fuss about virtual appliances as a means of distributing software. Because VMs let you put software in a box, there are an increasing number of tools that let you do security-related things to protect the box -- running firewalls, intrusion detection, introspection, etcetera all from outside a VM where the cooties that you download from the internet can't interfere.

The problem with all of these things is that it is ot at all clear what you should do when you realize that something has gone wrong: when you are looking at software from the outside, and you realize that it is just about to run an infected binary, or a buffer overflow, or some other variety of malaise, what can you do?

Existing systems use rather large hammers, like stopping the machine, turning bugs that might be used as a basis for attack into a guaranteed denial of service. The question is whether there are techniques that might be developed to shepherd execution within the VM away from problems, and allow execution to continue in a useful way.

This project idea is with Bill Aiello.

Parallelism isn't enough.

Everyone knows that CPUs aren't getting faster as fast as they used to. Instead of having moore cycles per core, we are getting moore cores per socket. This is great for a few workloads: in particular, massively parallelizable problems like string search which is great for the bioinformatics people, and media processing which is good for things like photoshop and HD video. Unfortunately, most applications aren't going to get any faster. Your OS is still going to take minutes to boot, firefox will still take forever to start, and it's still going to take a while to find that document you lost.

This is because IO, not computation, is the thing that is a huge bottleneck in common IT workloads, and parallelism, in it's current form, really isn't going to help.

So the first project here is to validate these claims. I'd like someone to simulate an infinitely fast processor and tell me how much faster it would make modern applications. How fast would Windows boot if it was only doing IO? How much would SpecWeb scores improve? This is a good candidate for a HotOS paper, and if you do it really fast it can be submitted to HotOS for the January deadline.

The broader project here is to try to solve the problem. I believe the solution is probably around trying to make better use of memroy and IO resources in order to very aggressively speculate, at a whole-system level, ahead of the current state of the system. So the question here is whether you can come up with useful ways to structure an OS and applications that take existing code, and allow it to take advantage of parallelism to go faster, at a system level.

An initial idea here is to use data dependency-based continuations as the basis for representing systems. Capriccio did this in a way. At a high level, you arrange to structure the system as a graph of blocking I/O states, and use that graph structure to do things like speculative execution and prefetching.