Andrew Warfield

Projects!

The following are a set of incubatory project ideas that I would be interested in supervising Masters or PhD-level research projects on. These aren't the only things that I'm interested in supervising, but if you are having trouble thinking up an interesting thing on your own, they may be of interest. Note that some of the things here are related to existing projects in the lab, and some are ideas that I've discussed with other faculty -- I've tried to identify both of these classifications wherever possible.

SecondSite/Remus: Highly available and disaster tolerant systems.
Hindsight: Logging execution to a big database and analysing it to understand what systems have done.
Exceptions and Black Boxes: What do you do when you know something is about to go wrong with a system that you can't change?
Parallelism isn't enough: And why it's probably more likely to cure cancer than to fix your OS.

If any of these seem interesting to you, feel free to drop me an email.

Remus, SecondSite, High-Availability, Disaster Recovery

The Remus project built Extensions to Xen that continusously replicated a running virtual machine onto a second physical computer, and allows you to pull the plug on one machine and have your VM seamlessly continue execution on a the backup. This was Brendan Cully's masters project, and it is an example of what I consider to be an excellent masters thesis: It's a complete thing, it's being released to open source, and it covers a range of topics in systems, including virtual memory, networking, and storage. Brendan's thesis is well worth reading if you are looking for an example Master's thesis.

Ryan O'Connor is an undergraduate research assistant in the Lab that has been working on making remus work over a long-distance link between Vancouver and Kamloops. If you are interested in this class of highly-available system, Brendan and Ryan are both great people to chat to for ideas and to find out about the code.

There are a bunch of interesting things to be done on Remus:

Disaster Tolerant Computing. Ryan's current work on Remus could quite easily be turned into a number of masters-thesis-level things. We're trying to make Remus run well when the pair of physical machines are in geographically isolated locations, hundreds of kilometers apart. We have built a testbed of two machines, one in Vancouver and one at TRU in Kamloops, and are able to failover VMs between them. However, there are a number of challenges to solve here:

Compression: Long distance links are very expensive, and so it's important to reduce the amount of replication traffic that is sent between them. A project here would explore how we can compress pages being sent, identify pages that don't need to be sent at all, and ammortize the cost of replication across a number of VMs. If you are interested in this area, you may want to look at things like the TRAP-Array paper from ISCA, LBFS, and the recent UCSD Difference Engine work at OSDI 2008.
Traffic redirection: In the original Remus work, failover was between two hosts on the same subnet. The only network-layer work required to failover is to send an unsolicited ARP advertisement to notivy the upstream switch that the VM had moved. In the wide area case, we must do a great deal more work: The IP address must move, and BGP updates must be sent to that Internet traffic is redirected appropriately. The challenge here is to understand how quickly updates can be made and propagated, and how well this works with various BGP and dual-homing configurations.
Failure Detection: Failure detection in the original Remus used a pair of redundant network connections. This is not practical in the wide area, given that long distance connections are very expensive, and ensuring that they are physically isolated (from backhoes, for instance) is quite hard. We need to develop a distributed, internet-based failure detection system that allows external hosts to help the two protected sites decide when to failover. There are some challenging problems in this space around heartbeating, and distributed failure detection.

Better Mousetraps. Remus's big benefit is that it doesn't require that applications or OSes be changed in order to support high availability. It's biggest weakness is that the transparently currently comes at a high cost: we copy lots of data, and incur a large overhead for output commit to prevent people from seeing speculative state. Christopher Head, an undergrad research assistant in the lab, has been looking in to ways to fix these issues:

Exposing Remus to applications:One option is to let applications know that they are running on top of a replication layer, and let them elect not to protect certain pages in memory. This has the potential to be very useful for databases (which manage their own memory and generally use it for caching) and may also be useful for garbage collected runtimes.
Reducing network delay: Remus doesn't sent any traffic out to the network until it knows that the machine state responsible for that traffic has been checkpointed. This introduces delays of between 20 and 200ms on RTTs and really messes with latency-sensitive things like talking to network attached storage. One way to fix this might be to build some protocol-specific state trackers, and replicate undo logs between hosts at a higher frequency than the rest of the system.

Byzantine Remus. Well, okay, not Byzantine Remus -- who would be crazy enough to do that -- but how about non-fail-stop HA? the current system assums that failure is absolute, and that it is from hardware. If you crash or have software problems, you are just going to replicate those problems onto the other system and have absolutely no help. It would be very interesting to think about whether Remus can be used to help recover from software errors, by rewinding and replaying systems that crash. People interested in this area should look at the SOSP Rx paper, OSDI Failure Oblivious Computing, and maybe also Triage. This is work that I am currently doing with Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Chris Head, Ryan O'Connor, Norm Hutchinson, and Mike Feeley.

HindSight: Understanding what software is doing

We don't really know what applications are doing. We know what the are supposed to do, and we have their source, which is a vague codification of what they are supposed to do. Unfortunately, when applications crash or are attacked, they generally don't do what they are supposed to and we are left staring into a smoking hole trying to figure out what went wrong.

What if you had a detailed recording of everything your applications and OS had done for weeks or months of execution? What if you could issue queries against this trace data to understand execution? What if these traces were so detailed that you could use tehm to reconstruct a running version of the system at any point in time, and resume execution in order to interact with it and ask questions? How would you use this data to improve software understanding, debugging, security, etcetera?

We have build an initial version of this system and are starting to build tools on top of it. Areas of particular interest include:

Helping developers understand and debug code. Looking at source, we can use logging to identify code paths that really ran in practice and ones that are dead. Debugging as a query-driven process allow you to ask questions like: "What things started here, and wound up over there?"
Bilding better test and exploit tools. Because we can regenerate system state and force it down different paths of execution.
Data mining for bugs.Performing control and data flow analysis, I'm interested in discovering whether we can do outlier detection to infer things like calls that should have had a lock held, but that don't.
Visualizing execution.These systems are incredibly complex and contain millions of likes of code. There is an absolutely enormous amount of data in these traces and one of the biggest challenges is how to present it in an understandable way to humans.

This is work that I am currently doing with Geoffrey Lefebvre, Dutch Meyer, Brendan Cully, Norm Hutchinson, and Mike Feeley.

Exception and Black Boxes

Virtual machines are really good at isolating software. This is why there is a lot of fuss about virtual appliances as a means of distributing software. Because VMs let you put software in a box, there are an increasing number of tools that let you do security-related things to protect the box -- running firewalls, intrusion detection, introspection, etcetera all from outside a VM where the cooties that you download from the internet can't interfere.

The problem with all of these things is that it is ot at all clear what you should do when you realize that something has gone wrong: when you are looking at software from the outside, and you realize that it is just about to run an infected binary, or a buffer overflow, or some other variety of malaise, what can you do?

Existing systems use rather large hammers, like stopping the machine, turning bugs that might be used as a basis for attack into a guaranteed denial of service. The question is whether there are techniques that might be developed to shepherd execution within the VM away from problems, and allow execution to continue in a useful way.

This project idea is with Bill Aiello.

Parallelism isn't enough.

Everyone knows that CPUs aren't getting faster as fast as they used to. Instead of having moore cycles per core, we are getting moore cores per socket. This is great for a few workloads: in particular, massively parallelizable problems like string search which is great for the bioinformatics people, and media processing which is good for things like photoshop and HD video. Unfortunately, most applications aren't going to get any faster. Your OS is still going to take minutes to boot, firefox will still take forever to start, and it's still going to take a while to find that document you lost.

This is because IO, not computation, is the thing that is a huge bottleneck in common IT workloads, and parallelism, in it's current form, really isn't going to help.

So the first project here is to validate these claims. I'd like someone to simulate an infinitely fast processor and tell me how much faster it would make modern applications. How fast would Windows boot if it was only doing IO? How much would SpecWeb scores improve? This is a good candidate for a HotOS paper, and if you do it really fast it can be submitted to HotOS for the January deadline.

The broader project here is to try to solve the problem. I believe the solution is probably around trying to make better use of memroy and IO resources in order to very aggressively speculate, at a whole-system level, ahead of the current state of the system. So the question here is whether you can come up with useful ways to structure an OS and applications that take existing code, and allow it to take advantage of parallelism to go faster, at a system level.

An initial idea here is to use data dependency-based continuations as the basis for representing systems. Capriccio did this in a way. At a high level, you arrange to structure the system as a graph of blocking I/O states, and use that graph structure to do things like speculative execution and prefetching.

contact