Research

My principal research interests are in operating system scheduling for chip multithreaded processors and in exploring synergistic hardware/software designs of chip multithreaded systems. I also do performance/cache modeling for chip multithreaded systems. My other interests are in transactional memory systems. Below is the list of projects that I have worked on in the past and am working on presently.

2005-present: Transactional Memory

With the Scalable Synchronization group at Sun Microsystems Labs, I developed a Berkeley DB benchmark for the Hybrid Transactional Memory System, and evaluated the system using that benchmark. This is one of the first efforts to convert a real application to use transactional memory.

2002-2006: Ph.D. Thesis: Operating System Scheduling for Chip Multiprocessors.

Chip multiprocessors, the currently emerging family of CPUs, run multiple application threads in parallel, and, in contrast with conventional processors, threads share many of the processor resources. As a result, a thread's performance depends on the threads with which it is scheduled to run. A thread scheduler decides how to assign threads to processors in order to achieve goals such as optimal performance or fairness. Scheduling algorithms for chip multiprocessors must take into account the interaction between the threads in order to work properly. For my thesis I have developed three scheduling algorithms for chip multiprocessors: the cache-fair algorithm, the non-work-conserving algorithm, and the target-miss-rate algorithm.

The cache-fair algorithm addresses schedule-dependent performance variability, a phenomenon where an application's CPU performance depends on the threads with which it is scheduled. Schedule-dependent variability makes it difficult to forecast application performance hindering the design of systems that provide quality-of-service guarantees. The shared second-level (L2) CPU cache is the source of schedule-dependent performance variability on chip multiprocessors with single-threaded processing cores. My algorithm uses an analytical model to estimate the L2 cache miss rate a thread would have if the cache were shared equally among all the threads, the fair miss rate. It then adjusts the thread's share of CPU cycles in proportion to its deviation from its fair miss rate. This reduces the effect of the schedule-dependent miss rate variability on the thread's runtime. In cases of significant variability the cache-fair algorithm reduces the schedule-dependent performance variability by at least a factor of two and sometimes by as much as a factor of seven. This algorithm has been implemented in a Solaris operating system, works without any advance knowledge about the workload and relies only on hardware performance counters typically available on modern processors.

The non-work-conserving algorithm employs an online analytical model to determine whether it is beneficial to run fewer threads on a shared-cache processor than the maximum possible, for the sake of relieving the pressure on the L2 cache and improving performance. This algorithm correctly identifies such situations in most scenarios, and significantly improves performance.

The target-miss-rate algorithm dynamically identifies the threads that contribute most to the overall L2 cache miss rate by observing the aggregate DRAM traffic, and, by lowering priorities of such threads, achieves a target L2 miss rate.

2000-2002: Application Performance on the Direct Access File System

In collaboration with Kostas Magoutis and Salimah Addetia

The Direct Access File System (DAFS) is a distributed file system built on top of direct-access transports (DAT). Direct-access transports are characterized by using remote direct memory access (RDMA) for data transfer and user-level networking. The motivation behind the DAT-enabled distributed file system architecture is the reduction of the overhead on the I/O data path. The goal of this work is to determine whether the architecture of DAFS brings any fundamental performance benefits to applications compared to traditional distributed file systems. To accomplish this, we conducted a performance evaluation of DAFS and compared it to a traditional network storage system: a version of NFS optimized to reduce the I/O overhead. We concluded that DAFS can accomplish superior performance for latency-sensitive applications. Bandwidth-sensitive applications do equally well on both systems, unless they are CPU-bound, in which case they perform better on DAFS. We also found that RDMA is a less restrictive mechanism to achieve copy avoidance than that used by the optimized NFS.

Related Papers:

Alexandra Fedorova, Margo Seltzer, Kostas Magoutis, and Salimah Addetia, "Application Performance on the Direct Access File System'', In Proceedings of Workshop on Software and Performance 2004 (WOSP'04) , January 14-16, 2004, Redwood City, CA. (PDF)

Kostas Magoutis, Salimah Addetia, Alexandra Fedorova, Margo I. Seltzer, "Making the Most out of Direct Access Network-Attached Storage", In Proceedings of Second USENIX Conference on File and Storage Technologies (FAST'03), San Francisco, CA, March 31-April 2, 2003. ( get a copy )

Kostas Magoutis, Salimah Addetia, Alexandra Fedorova, Margo I. Seltzer, Jeffrey S. Chase, Andrew J. Gallatin, Richard Kisley, Rajiv G. Wickremesinghe, Eran Gabber, "Structure and Performance of the Direct Access File System", In Proceedings of USENIX Annual Technical Conference, Monterey, CA, June 9-14, 2002. ( get a copy)

Visit the project web site