Projects!
The following are a set of incubatory project ideas that I
would be interested in supervising Masters or PhD-level
research projects on. These aren't the only things that I'm
interested in supervising, but if you are having trouble
thinking up an interesting thing on your own, they may be of
interest. Note that some of the things here are related to
existing projects in the lab, and some are ideas that I've
discussed with other faculty -- I've tried to identify both of
these classifications wherever possible.
- SecondSite/Remus: Highly available and
disaster tolerant systems.
- Hindsight: Logging execution to a big
database and analysing it to understand what systems have done.
- Exceptions and Black Boxes: What do you
do when you know something is about to go wrong with a system that you
can't change?
- Parallelism isn't enough: And why it's
probably more likely to cure cancer than to fix your OS.
If any of these seem interesting to you, feel free to drop me an email.
Remus, SecondSite, High-Availability, Disaster Recovery
The Remus project built Extensions to Xen that
continusously replicated a running virtual machine onto a
second physical computer, and allows you to pull the plug on
one machine and have your VM seamlessly continue execution on
a the backup. This was Brendan Cully's masters project, and
it is an example of what I consider to be an excellent
masters thesis: It's a complete thing, it's being released to
open source, and it covers a range of topics in systems,
including virtual memory, networking, and storage.
Brendan's thesis
is well worth reading if you are looking for an example
Master's thesis.
Ryan O'Connor is an undergraduate research assistant in the Lab
that has been working on making remus work over a long-distance link
between Vancouver and Kamloops. If you are interested in this class
of highly-available system, Brendan and Ryan are both great people to
chat to for ideas and to find out about the code.
There are a bunch of interesting things to be done on Remus:
Disaster Tolerant Computing. Ryan's current work on Remus
could quite easily be turned into a number of masters-thesis-level
things. We're trying to make Remus run well when the pair of physical
machines are in geographically isolated locations, hundreds of
kilometers apart. We have built a testbed of two machines, one in
Vancouver and one at TRU in Kamloops, and are able to failover VMs
between them. However, there are a number of challenges to solve
here:
- Compression: Long distance links are very expensive,
and so it's important to reduce the amount of replication traffic that
is sent between them. A project here would explore how we can
compress pages being sent, identify pages that don't need to be sent
at all, and ammortize the cost of replication across a number of VMs.
If you are interested in this area, you may want to look at things
like the TRAP-Array paper from ISCA, LBFS, and the recent UCSD
Difference Engine work at OSDI 2008.
- Traffic redirection: In the original Remus work,
failover was between two hosts on the same subnet. The only
network-layer work required to failover is to send an unsolicited ARP
advertisement to notivy the upstream switch that the VM had moved. In
the wide area case, we must do a great deal more work: The IP address
must move, and BGP updates must be sent to that Internet traffic is
redirected appropriately. The challenge here is to understand how
quickly updates can be made and propagated, and how well this works
with various BGP and dual-homing configurations.
- Failure Detection: Failure detection in the original
Remus used a pair of redundant network connections. This is not
practical in the wide area, given that long distance connections are
very expensive, and ensuring that they are physically isolated (from
backhoes, for instance) is quite hard. We need to develop a
distributed, internet-based failure detection system that allows
external hosts to help the two protected sites decide when to
failover. There are some challenging problems in this space around
heartbeating, and distributed failure detection.
Better Mousetraps. Remus's big benefit is that it doesn't
require that applications or OSes be changed in order to support high
availability. It's biggest weakness is that the transparently
currently comes at a high cost: we copy lots of data, and incur a
large overhead for output commit to prevent people from seeing
speculative state. Christopher Head, an undergrad research assistant
in the lab, has been looking in to ways to fix these issues:
- Exposing Remus to applications:One option is to let
applications know that they are running on top of a replication layer,
and let them elect not to protect certain pages in
memory. This has the potential to be very useful for databases (which
manage their own memory and generally use it for caching) and may also
be useful for garbage collected runtimes.
- Reducing network delay: Remus doesn't sent any
traffic out to the network until it knows that the machine state
responsible for that traffic has been checkpointed. This introduces
delays of between 20 and 200ms on RTTs and really messes with
latency-sensitive things like talking to network attached storage.
One way to fix this might be to build some protocol-specific state
trackers, and replicate undo logs between hosts at a higher frequency
than the rest of the system.
Byzantine Remus. Well, okay, not Byzantine Remus -- who
would be crazy enough to do that -- but how about non-fail-stop HA?
the current system assums that failure is absolute, and that it is
from hardware. If you crash or have software problems, you are just
going to replicate those problems onto the other system and have
absolutely no help. It would be very interesting to think about
whether Remus can be used to help recover from software errors, by
rewinding and replaying systems that crash. People interested in this
area should look at the SOSP Rx paper, OSDI Failure Oblivious
Computing, and maybe also Triage.
This is work that I am currently doing with Brendan Cully, Geoffrey
Lefebvre, Dutch Meyer, Chris Head, Ryan O'Connor, Norm Hutchinson, and
Mike Feeley.
HindSight: Understanding what software is doing
We don't really know what applications are doing. We know what the
are supposed to do, and we have their source, which is a vague
codification of what they are supposed to do. Unfortunately, when
applications crash or are attacked, they generally don't do what they
are supposed to and we are left staring into a smoking hole trying to
figure out what went wrong.
What if you had a detailed recording of everything
your applications and OS had done for weeks or months of execution?
What if you could issue queries against this trace data to understand
execution? What if these traces were so detailed that you could use
tehm to reconstruct a running version of the system at any point in
time, and resume execution in order to interact with it and ask
questions? How would you use this data to improve software
understanding, debugging, security, etcetera?
We have build an initial version of this system and are starting to
build tools on top of it. Areas of particular interest include:
- Helping developers understand and debug code.
Looking at source, we can use logging to identify code paths that
really ran in practice and ones that are dead. Debugging as a
query-driven process allow you to ask questions like: "What things
started here, and wound up over there?"
- Bilding better test and exploit tools. Because we
can regenerate system state and force it down different paths of
execution.
- Data mining for bugs.Performing control and data
flow analysis, I'm interested in discovering whether we can do outlier
detection to infer things like calls that should have had a lock held,
but that don't.
- Visualizing execution.These systems are incredibly
complex and contain millions of likes of code. There is an absolutely
enormous amount of data in these traces and one of the biggest
challenges is how to present it in an understandable way to humans.
This is work that I am currently doing with Geoffrey Lefebvre,
Dutch Meyer, Brendan Cully, Norm Hutchinson, and Mike Feeley.
Exception and Black Boxes
Virtual machines are really good at isolating software. This is why
there is a lot of fuss
about virtual
appliances as a means of distributing software. Because VMs let
you put software in a box, there are an increasing number of tools
that let you do security-related things to protect the box -- running
firewalls, intrusion detection, introspection, etcetera all from
outside a VM where the cooties that you download from the internet
can't interfere.
The problem with all of these things is that it is ot at all clear
what you should do when you realize that something has gone wrong:
when you are looking at software from the outside, and you realize
that it is just about to run an infected binary, or a buffer overflow,
or some other variety of malaise, what can you do?
Existing systems use rather large hammers, like stopping the
machine, turning bugs that might be used as a basis for attack into a
guaranteed denial of service. The question is whether there are
techniques that might be developed to shepherd execution within the VM
away from problems, and allow execution to continue in a useful
way.
This project idea is with Bill Aiello.
Parallelism isn't enough.
Everyone knows that CPUs aren't getting faster as fast as they used
to. Instead of having moore cycles per core, we are getting moore
cores per socket. This is great for a few workloads: in particular,
massively parallelizable problems like string search which is great
for the bioinformatics people, and media processing which is good for
things like photoshop and HD video. Unfortunately, most applications
aren't going to get any faster. Your OS is still going to take
minutes to boot, firefox will still take forever to start, and it's
still going to take a while to find that document you lost.
This is because IO, not computation, is the thing that is a huge
bottleneck in common IT workloads, and parallelism, in it's current
form, really isn't going to help.
So the first project here is to validate these claims. I'd like
someone to simulate an infinitely fast processor and tell me how much
faster it would make modern applications. How fast would Windows boot
if it was only doing IO? How much would SpecWeb scores
improve? This is a good candidate for a HotOS paper, and if you do it
really fast it can be submitted to HotOS for the January deadline.
The broader project here is to try to solve the problem. I believe
the solution is probably around trying to make better use of memroy
and IO resources in order to very aggressively speculate, at a
whole-system level, ahead of the current state of the system. So the
question here is whether you can come up with useful ways to structure
an OS and applications that take existing code, and allow it to take
advantage of parallelism to go faster, at a system level.
An initial idea here is to use data dependency-based continuations
as the basis for representing
systems. Capriccio
did this in a way. At a high level, you arrange to structure the
system as a graph of blocking I/O states, and use that graph structure
to do things like speculative execution and prefetching.