A)
In the MMCN paper, we had the experiment where we ran N videos in a
single QStream process. We could look at values 1..N such that CPU
never becomes saturated. For each case, we can compute the average
CPU requirement per video (total CPU time / N). If this average
rises, we might argue that the increases represent the overhead of
"application level switching" (i.e. the event-loop). I expect
this overhead to be low or negligble, but you never know.
Now we look at the data as N increases well into the range of CPU
saturation. Based on the assumption that user-level switching is
significantly cheaper than kernel context-switching, these numbers
give us a kind of best-case bound on kernel-scheduling performance,
both in terms of throughput and responsiveness. We measure
throughput by the overall average frame rate. We measure
responsiveness directly for every deadline in the application (already
implemented). (i.e. we measure difference between the actual service
time and the deadline time originally requested).
B)
Then we have worst-case baseline. We run N QStream processes,
playing 1 video each, with the stock Linux scheduler. We have done
these experiments, but they were not part of the MMCN paper and our
analysis of the results is ongoing. What I expect here is to see a
small, perhaps even negligble impact on througput, but serious
degradation of responsiveness as N drives into the CPU saturation
zone. We definitely see the latter, but we need to go over the data
to verify the former. This is worst case in that it has the
worst inbalance of throughput over responsiveness.
Notice that there may be cases of kernel scheduler pre-emption here,
but I think they will be relatively infrequent (possibly negligble)
because of the kernel's conservative default timeslice parameters. I
think we should be able to use existing Linux tracing tools, or
perhaps do some simple instrumentation of our own in the kernel, to
verify the context switch behaviour. In fact, I think it will be
valuable to measure the number of context switches, and the number of
pre-emptive context swtiches for all of these experiments. We should
measure the number of systems calls too.
Abishek, perhaps setting up this instrumentation will be a good task
for you. You can have Ani show you how to run experiments A and B.
Then modify them to include the above measurements.
C)
Next, we have the Linux Real-Time + application-level EDF. If this
works as planned, I expect to see significant improvement in
responsiveness, but a decrease in throughput. As you point out, we
will need to be careful to ensure it is work-conserving. This is
pretty easy to verify for high values of N, since we can verify that
CPU load is 100%. For lower values, we simply verify that the
application is doing the same work (decode + display the same number
of frames) as in A and B. My conjecture is that this experiment
should be fully co-operative; i.e. the kernel should never actually
pre-empt any of the processes, and all context switches should be the
result of voluntary sleeping. So again relative to B), we should
see a marked increase in the number of context switches, equal rate of
system calls, and still see a negligble amount of pre-emptions.
This case marks the logical extreme of responsiveness over throughput.
D)
We do Linux Real-Time + application-level periodic scheduling. I
only just realized from our discussion that doing this probably makes
sense, maybe more than EDF. Implementing this should be a simple
modification of the application-level yielding mechanism. Now we
just round-robin every X milliseconds. So instead of sleeping until
our next deadline, we sleep until our next period. By increasing
values of X, we should be able to measure a trade-off between
responsiveness and througput. Again, there should be no
pre-emptions. Context-switch rate should be constant for each
experiment, i.e. a function of N and X. For each X,
responsiveness should be bounded by an amount proportional to X, and
average throughput should improve as X increases.
This represents a range of throughput-responsiveness points. At one
end, as X gets small, responsiveness will reach a floor comparable to
the level of C) (although throughput will continue to decrease). At
the other end, as X gets large, throughput will plateau, probably
comparable to B).
In C and D, the accuracy of the kernel timer mechanism will matter
(i.e. default 4ms vs 1ms vs hrt). I don't think it really matters
too much for A or B.
E)
We modify Linux to have short timeslices [basically doing kernel
periodic scheduling for all processes]. We run QStream without
Real-Time or cross process co-operation. This provides our first
opportunity to measure the cost of pre-emptive context switches, vs
plain context switches. We should be able to observe the same
type of trends as D, but there should be some througput skew relative
to D because now almost all context switches should be pre-emptive.
If there is no skew, then this would invalidate the conjecture that
voluntary switching is significantly better than involuntary
switching.
It also tells us something about the cost of policing.
If analsyis of C) D) and E) justifies it, we may also try this:
F) A hybrid between C and D. Can we improve by using deadlines
and periods together? The idea here would be to do some minimum
batching of deadlines inside QStream (to reduce the context-switch
rate relative to C), but make yields more strategic than D) because we
still take some advantage of knowing others' deadlines. This
modifies C) because a process does not necissarily advertize its next
deadline when it yields, but instead it advertizes the first deadline
that falls into the next period (X millisecond window), or after.
So the throughput should be equal to D) but maybe responsiveness will
improve (?). I think this will only help if ratio of deadlines to
periods is small. Perhaps we will think of some more effective ideas
for a hybrid approach.
--
AnirbanSinha - 22 Aug 2006