Hi and Welcome smile

This is the page related to my research. The following is the log of the activities completed arranged chronologically with date.

Benchmarks conducted for the MMCN Paper (paper submitted on 1st August 2006, acceptance notification due: 1st September 2006) - Paper Accepted in MMCN 2007

Most of the work reported here were completed before Aug 1st 2006

ubc-mmcn07.pdf: MMCN Final Accepted paper

We instrumented both mplayer and VLC for measuring jitter, fps and CPULoad related parameters. In particular, we modified the video display modules by placing printf()'s at appropriate places that print out the current time and frame jitter at that time. From this basic information, we later calculate FPS by running another script on the dump files. We make sure that in all the experiments, we allow mplayer and VLC to drop frames when needed.
- If X instances of mplayer and Y instances of VLC just saturates the CPU by playing a specific XVID(mpeg-4) video (all of them play the same video at the same position) , we measure their individual fps and frame jitter. At the same time, we measure the overall CPU load on the system for the duration of playback. We run this experiments on a Dell Inspiron 1300 laptop.
- Next, we perform the same experiment with 2X mplayer instances and 2Y VLC instances and measure the same parameters. We run this experiments on a Dell Inspiron 1300 laptop.
- From the above two experiments, we show that VLC performs worse of the two under heavy CPU load.
- Next, we patch the instrumented mplayer for playing SPEG video. We perform the same experiment as above and note the new values for X and 2X. Lets say the new value of X be X'. We note that X'<X. We run this experiments on a P-IV 3Ghz desktop with 1Ghz ram.
We perform the same experiments using QStream this time with a single QStream process running multiple videos.
- We used Qmon for dumping the performance data on to some text files and plot them. We note the threshold # of videos that just saturates the CPU, say Z. We run increasing number of videos upto Z to check how QStream adapts to increasing load and then go ahead upto 2Z. We run Qstream client on a P-IV 3Ghz desktop with 1Ghz ram and 100 Mbps ethernet. We run the server and monitor on a different machine on the same subnet. Later we plot the fps, cpuload and jitter values for each of the videos.
- We run another set of similar experiments as above with boosting the temporal quality of one video (meaning it would drop less frames) among the rest. We show that in the case where all videos have equal priorities, all of them degrade gracefully, with boosted case, one of them definitey perform better compared to the rest.
- Lastly, we run a single video with Qstream and plot the video bitrate against cpu load and show that video bitrate of a MPEG 4 video vary greately over time and the cpuload also varies acordingly showing a strong correlation between the two. This clearly shows that the entropy of a video varies greatly over time and so does the number of bits used to encode a single frame. Thus, when we play multiple videos on a single machine, cpu requirements are a result of combined requirements of all the videos taken together which is very difficult to predict, much less make provisions for it beforehand. Thus, this shows our belief that graceful quality adaptation across multiple videos is a difficult problem which is exasperated by the fact that hardware capability vaies greately from device to device, from handhelds having low processing power (for saving battery life) to high end dual core desktop machines.

Experiments performed as a continuation of the ones on the MMCN Paper but not reported on the submitted paper (perhaps will be reported on the camera ready once the paper gets accepted).

The work reported here was completed before Aug 15th 2006

QStream tests as before but this time, we run N instances of qstream, each running one video.
- We do the same set of tests as before but now each of the qstream processes run one video each and we also make sure that we skip each video by different amounts so as to bring a heterogeneous environment where each video has varying cpu requirements. We disable alsa sound output because we find that contention among the different qstream processes for acquiring the single device causes major stops in videos. We compare the results with those of the previous benchmarks and find that:-
  - Boosting has no effect - pretty understandable because the videos do not communicate among each other about their deadlines. Further, The scheduler does not know anything about the boosted status of one video.
  - Unfairness among the videos is more pronounced due to the scheduler knowing nothing about the deadlines of the players.
- We intend to perform the same set of experiments with mplayer and sound disabled for more fair comparison in the final version of the paper.

We performed some minor bug fixes with qstream as we went along, modified the decoder for more fine grained frame level cpu adaptation frame dropping etc.

Readings completed (as of 31/08/2006)

Major portions of Buck's thesis. Chapters on performance analysis and multicast were avoided.
Read the rejected NOSDAV paper and the soft timer's paper.
Read Qstream qsf source code and shared memory IPC.
Preliminary coding of Cooperative Scheduler done; benchmarks done;

Work done as a part of the Eurosys paper (Initial submission: 20th September, 2006, 10 pm, Paper rejected in Eurosys 2007 :()

Most of this work was done during the period September 1st 2006 to September 20th 2006

eurosys07-submit.pdf: Eurosys submitted paper

The EDF scheduler

Our proposal is that each of the real time applications act in a more co-operative fashion through a combination of voluntary yielding and providing explicit information about their timing requirements. We call the approach cooperative polling. The approach is based on a reactive event-based programming regime. In this way, using such fine grained approach, each of the applications should be able to hit their deadlines at the right time. If ever missed, we shall still have a bounded latency.
However, there are caveats in this approach. Firstly, we hit a large number of context switches since this is very fine-grained design. The number and frequency of context switches depends on how close the events are on the timeline, time to execute an event and context switch time. A high context switch overhead will translate into inefficient use of cpu which possibly could have been used for some other useful work. Secondly, what happens if one or more of the applications lies about their own internal deadlines and other private parameters? In such a case, one application can easily steal cpu time from other applications. A traditional kernel level cpu scheduler would have no knowledge of the higher level process specific notions like deadlines and priority. To deal with this situation, we propose to design a hybrid scheduler, one that is in between a fine-grained cooperative scheduler and a simple periodic scheduler. In this scheme, we choose an appropriate period, say 20 ms and divide it uniformly between all the competing real time tasks. Each of the processes gets a uniform allocations to start with. As each of the applications runs, they use up their allocation. The kernel wakes up the process that has the highest priority nearest deadline and runs them until the deadline of some other process is about to expire. The kernel then wakes up that process and runs it. When an application runs, it uses up its own allocation. When its allocation becomes zero, the kernel forces it to sleep till next epoch, which is 20 ms, the period of our scheduler. In this scheme, each of the applications would still be able to hit their deadlines, but with a maximum tardiness equal to the period of our scheduler. For example, if the allocation of a process expires, it will sleep, but it is possible that one of its own deadlines expired just on the moment it went to sleep. This deadline event can not be served by the cpu until it wakes up again, in this case after 20 ms. On the other hand, the benefit of this approach is that no one gains by lying to the kernel. If one does lie, he will use up his allocation earlier than others and will be forced to sleep. Thus, by adjusting the period, we can adjust the maximum amount of tardiness that we can tolerate, have a policing effect and also achieve an upper bound on the number of context switches. Reducing the number of context switches and policing comes with the price of more coarse grained context switches leading to reduced tardiness. The main theme however in this kind of mixed approach is to share the cpu reasonably uniformly across all the processes, even if one of them cheats the kernel and at the same time hit deadlines within reasonable bounds.
However, in the above discussion, we assumed that there are no other cpu intensive non-real time tasks running in the system. In general, this assumption is not true. When there are other cpu jobs in queue, the cpu division across all the real time tasks may not be uniform. In order to account for the fact that their might be other non-real time throughput oriented applications running in the system that might also need a share of the CPU time, we propose to reserve a fixed percentage of cpu time (say 20%) that can be allocated to these other tasks. Though, this might look like to be an easy solution to the problem, it is not very elegant. The obvious question that we find ourselves asking here is that how much cpu do we want to reserve beforehand? Whether this reservation is enough for all times? Does this ever lead to wastage of CPU cycles when there are few other non-real time tasks running etc. We are in the process of trying to find other elegant and feasible solutions to this problem. In general, we believe that this is an extremely difficult problem to address.
In order to get an idea of the different kinds of issues that we might encounter in our design and the kind of results we may expect, we have a user level implementation of pure cooperative scheduling scheme, the hybrid scheme and pure periodic scheme using Qstream. Each of the Qstream applications communicate their deadline and deadline priorities to each other using shared memory IPC. All these applications yield the CPU (using a condition variable and wait) to other fellow applications depending on their own deadlines and on the nearest external deadlines. The interesting part here is that they explicitely wake up the other application before going to sleep using this condition variable. Thus, each of the applications are tightly bound to each other. The degree to which they are bound depends on the scheduling algorithm used. The tightness varies from high to low as we go from pure cooperative to pure periodic.
We have performed benchmarks using qstream processes that use all the three kinds of scheduling mechanisms discussed above and compared them with single process-multiple video and independent multi-video cases.

The Results and Analysis

The hybrid and periodic scheduler just does not give proper results. Tardiness and FPS values for both of these schemes are quite off from theoritical expectations.
A cooperative EDF scheduler produces results very close to the single video case. However, we see an overall context switch rate of more than 3000 per second.
The FPS rate of all the videos are very close to each other for pure-cooperative case, something that is impossible to achieve in the independent player case.
The most significant parameter that we observe in the pure cooperative case is that the fps throughput is not significantly lower than the other cases. This perhaps indicates that the context switch overhead is perhaps not too great so as to affect throughput. This is one of the interesting take from the benchmarks. however, the question of one application lying about its deadlines still remains in the cooperative case.
We expected maximum throughput in the periodic pure 0% idle case. Also it can be expected that the context switch rate should be drastically minimum, proportional to the number of videos running, but this time bounded by our scheduling period, 10/20 ms. If the number of videos is N, we have N number of context switches per 10or 20 ms. This number becomes unbounded (limited by the duration between two deadlines) in the pure cooperative case. In the pure cooperative case, the granularity for video deadlines (hence context switch) is related to frame rate of video, so a granularity of ~33ms is pretty reasonable. If the video is two-way conference, then the audio can actually impose a tighter constraint than the video, because one will want to minimize the buffering delay of the audio device. In this case, a granulariy of more like 10ms will be required. The total deadline granularity for QStream would be more in the ballpark of 1-5ms (i.e. 200-1000 deadlines per second). So, if we have 10 of these running, we are talking somewhere between 2000 (10 x 200) and 10000 (10 * 1000) context switches per second with EDF. The actual numbers were in the ballpark of 3000 yields per second.
Tardiness is expected to decrease (meaning latency increasing) from pure cooperative case to pure periodic case. Maximum latency is expected to be bounded by the scheduler period, which is indeed the case.

Future Directions (Possible things that I will look into implementing in my thesis)

Modifying the the Linux scheduler (simply by reducing timeslices) may be an easy way to build a macro-benchmark that measures the overhead of varying rates of context-switching. The results should quantify the trade-off between responsiveness and throughput, which in turn should provide important guidamce us in designing those meaningful scheduling heuristics.
For varying the scheduling period X, we may get a scenario that represents a range of throughput-responsiveness points. At one end, as X gets small, responsiveness will reach a floor comparable to the level of C) (although throughput will continue to decrease). At the other end, as X gets large, throughput will plateau, probably comparable to B). It might be interesting to generate this plot.
Need to quanitatively measure voluntary yields Vs premption overheads.
Regenerate graphs for VLC and MPlayer on the desktop machine with 4 and 8 videos. Right now, these experiments were done on a laptop and the comparisons seem suspect.

Complete implementation of the periodic and/or BVT scheduler. This would allow fair sharing of resources rather than fair sharing of frame rates. It might also help show the trade-off between overhead and dispatch latency.

Implement coop_poll in the kernel. I think it would be really useful to get some experience implementing this call in the kernel. In particular, the policying part will give up an idea about how often cooperatively polled threads are preempted in practice. Next, we could use some kind of feedback to add a little yield tolerance. For example, if interrupts are being generated rapidly, this tolerance could be made higher to deal with time lost as a result of interrupt handling. I think Luca/John Regehr had some work of this kind. Paul (Jon's student) mentioned it also. Initially, coop_poll could be implemented starting from the sched_yield() call.

With the coop_poll implementation, it would be possible to deal with regular best-effort processes. This would be the third category of jobs. Here, all kinds of scheduling policies such as proportions could be used.

Incorporate cooperative scheduling into the X server. This seems like an interesting problem because I think there is not a lot of work on the interaction between time-sensitive and server applications.
Extension of the model to multi-processor scenarios. One possible approach is the Zeldovich et al (incl. Robert Morris) work on multiprocessor support with events, using the colored event concept. 6.5 Event crazier... extend the model to distributed systems... need an efficient distributed priority queue .... think DHT only harder.
Unifying coop_poll and regular poll(). Extend the struct pollfd to include coop_poll events for POLLIN and POLLOUT, so that the scheduling order of monolithic and cooperative execution are closer are the same, and that overall reactiveness is more controllable in the cooperative case. (i.e. so a group of coop process will react more quickly when one of them has audio device work to do).

User-level lazy aio. Idea, a way we can turn calls to legacy library code (e.g. berkeley db) from synchronous api to asynchronous without re-writing their implementation...

Work Done in the month of October 2006

First two weeks

Working with Bossa and trying to figure out whether it could be used to design an initial prototype for the kernel coop_poll implementation.
At the same time, reading basic kernel codes from Robert Love. Covered chapters 1 through 10 (everything except the file system related chapters).

Last two weeks and the first week of November

Gave up Bossa. Will use Xen instead as a tool for kernel development.
Done setting up of subversion repositories for source version control. Also done setting up and getting used to tools like lxr.
Setting up Xen.
Initial port of user level heap library into kernel. Successful kernel build.
More reading into actual kernel code from the O'reilly book and lxr.

Work Done in the month of November 2006.

Second week (November 5 - November 8 )

Attended OSDI Conference, Seattle, WA.

Last three weeks

Coding of coop_poll implementation in the kernel.
Coding of user level modifications to qsf. Added an extra command line switch to qsf so as to use coop_poll call instead of shared memory.
Implemented kernel compile menu option for coop_poll. Also inserted coop stats feature in the kernel.
More readings.

Work Done in the month of December 2006.

Debugged and completed implementation of coop_poll in the kernel.
Updated benchmark scripts to include bvt and coop_poll options.
Took benchmarks for coop_poll implementation.

Work Done in the month of January 2007.

More benchmarks done with bvt and coop_poll for a deeper understanding of bvt algorithm and its relationship with coop_poll.
More fine tuning of the bvt algorithm and redesign for removal of bugs (mostly done by Buck :))
More readings: kernel memory management.
Started coding BVT algorithm into the kernel. Work incomplete.

Last week of January (January 30th to February 2nd)

Attended MMCN Conference, San Jose, California.

Work Done in the month of February 2007.

No research. Interview preparation

Work Done in the month of March 2007.

First three weeks of March 2007 (till March 23rd)

Continuation of BVT coding.
Debugging of BVT. Hunting down the weird bug - was due to incomplete kernel build and inconsistent ABI view across the kernel drivers.
Benchmarks taken - kernel BVT + coop_poll. This was an inappropriate benchhmark tests. Later it was revealed that bbvt was not at all doing what it was suppoed to do.

Last week of March 2007 & First week of April 2007.

Some critical bugs found in the code. Bug fixed and more benchmarks taken.
BVT seems to perform well, but limited by jiffy granularity of context switch.

Work Done in the month of April 2007.

Second & Third week of April 2007.

High res timers in bvt code introduced.
Old jiffy level code unified with high res code through a compile time option.
Thesis code ported to 2.6.20 kernel tree and brought under version control.
Thesis code ported to 2.6.21 kernel tree and brought under version control.
The new completely fairshare heuristic (mo context switch until timer fires) unified with old buggy heuristic by a compile time option (see option CONFIG_BVT_PURE_FAIRSHARE).
A whole new tracing subsystem introduced.
Working fairshare scheduler implemented - all benchmarks complete. Analysis complete.
Code ported to old 16 kernel. 2.6.16 with highres patches and updated codebase seems to be working and producing results similar to 2.6.21 (or 20) kernel
We have designed a new benchmark component, tests with X disabled. We saw that results with noX are far more predictable in terms of timeliness than results with X enabled. Thus, there seems to be a major scope of work in terms of hacking the X server! This is significant result.
We have also seen other interesting results in terms of cache pollution and context switches.

Work during the month of May 2007.

Work started in integrating coop_poll with fair share scheduling. This is the essence of policing and the crux of my thesis. This is also the last and final bit of challenge left in my work
I will not look into the last four issues of the "future directions" section (see above). Some other students will take them up. Once complete, we will shoot for a good conference!

Work during the month of June 2007.

We together completed the policing implementation in the kernel (fianlly, sigh :()
Fresh set of benchmarks done to evaluate policing issues.
Code fine tuning and cleanups. Port to 21 kernel.
Benchmarking Ingo's CFS scheduler.

Work during the month of July 2007.

sosp-abstract-anirbans.pdf: Abstract submitted for SOSP Poster and WIP session
Benchmarks related to HD video and besteffort video encoding.
Made modifications in the existing algorithm to accommodate non-adaptive application.
Benchmarks for the above completed. Results are inconclusive.
- This remains a future work. Mayukh is probably going to have a look at this multidomain implementation.

Work during the month of Aug 2007.

No work done. Visit home back in India.
Some outlines of thesis writeup done.

Work during the month of Sept 2007.

Poster accepted at SOSP.
Thesis writing continues with full effort.

Work during the month of October 2007.

Thesis writing continues.
Poster designed. Work done related to the Demo.
Oct 14 - Oct 17: Attended SOSP at Stevenson, WA.
Thesis writeup given to Norm for second reading and corrections/suggestions.
Norm's corrections incorporated into the writeup.

November 2nd 2007

Presented thesis work in the DSG seminar slot - 1 PM to 2 PM.
This is the final M.Sc research presentation talk.
All systems faculties, adjunct faculties and many current and new students were present. Roughly 20 persons in all.
Handed in final printed out thesis to Faculty of Graduate Studies (one sided) and to Joyce (double sided).
M.Sc completed. fooh!

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
pdf	eurosys07-submit.pdf	r1	manage	382.7 K	2006-09-24 - 22:08	AnirbanSinha	Eurosys submitted paper
pdf	sosp-abstract-anirbans.pdf	r1	manage	78.7 K	2007-07-20 - 01:05	AnirbanSinha
pdf	ubc-mmcn07.pdf	r1	manage	282.3 K	2006-09-24 - 22:04	AnirbanSinha	MMCN Final Accepted paper

Raw edit | More topic actions

Topic revision: r27 - 2008-02-11 - AnirbanSinha