My principal research interests are in operating system scheduling for chip multithreaded
processors and in exploring synergistic hardware/software designs of chip multithreaded
systems. I also do performance/cache modeling for chip
multithreaded systems. My other interests are in transactional memory systems.
Below is the list of projects that I have worked on
in the past and am working on presently.
2005-present: Transactional Memory
With the Scalable Synchronization group at Sun Microsystems Labs, I developed
a Berkeley DB benchmark for the Hybrid Transactional Memory System, and evaluated
the system using that benchmark. This is one of the first efforts to convert
a real application to use transactional memory.
Thesis: Operating System Scheduling for Chip Multiprocessors.
Chip multiprocessors, the currently emerging family of CPUs, run multiple application threads in parallel, and, in contrast with conventional processors, threads share many of the processor resources. As a result, a thread's performance depends on the threads with which it is scheduled to run. A thread scheduler decides how to assign threads to processors in order to achieve goals such as optimal performance or fairness. Scheduling algorithms for chip multiprocessors must take into account the interaction between the threads in order to work properly. For my thesis I have developed three scheduling algorithms for chip multiprocessors: the cache-fair algorithm, the non-work-conserving algorithm, and the target-miss-rate algorithm.
The cache-fair algorithm addresses schedule-dependent performance variability, a phenomenon where an application's CPU performance depends on the threads with which it is scheduled. Schedule-dependent variability makes it difficult to forecast application performance hindering the design of systems that provide quality-of-service guarantees. The shared second-level (L2) CPU cache is the source of schedule-dependent performance variability on chip multiprocessors with single-threaded processing cores. My algorithm uses an analytical model to estimate the L2 cache miss rate a thread would have if the cache were shared equally among all the threads, the fair miss rate. It then adjusts the thread's share of CPU cycles in proportion to its deviation from its fair miss rate. This reduces the effect of the schedule-dependent miss rate variability on the thread's runtime. In cases of significant variability the cache-fair algorithm reduces the schedule-dependent performance variability by at least a factor of two and sometimes by as much as a factor of seven. This algorithm has been implemented in a Solaris operating system, works without any advance knowledge about the workload and relies only on hardware performance counters typically available on modern processors.
The non-work-conserving algorithm employs an online analytical model to determine whether it is beneficial to run fewer threads on a shared-cache processor than the maximum possible, for the sake of relieving the pressure on the L2 cache and improving performance. This algorithm correctly identifies such situations in most scenarios, and significantly improves performance.
The target-miss-rate algorithm dynamically identifies the threads that contribute most to the overall L2 cache miss rate by observing the aggregate DRAM traffic, and, by lowering priorities of such threads, achieves a target L2 miss rate.
Application Performance on the Direct Access
collaboration with Kostas Magoutis and Salimah Addetia
Direct Access File System
(DAFS) is a distributed file system built on top of direct-access
transports (DAT). Direct-access transports are characterized by
using remote direct memory access (RDMA) for data transfer and user-level
networking. The motivation behind the DAT-enabled distributed file
system architecture is the reduction of the overhead on the I/O
data path. The goal of this work is to determine whether the architecture
of DAFS brings any fundamental performance benefits to applications
compared to traditional distributed file systems. To accomplish
this, we conducted a performance evaluation of DAFS and compared
it to a traditional network storage system: a version of NFS optimized
to reduce the I/O overhead. We concluded that DAFS can accomplish
superior performance for latency-sensitive applications. Bandwidth-sensitive
applications do equally well on both systems, unless they are CPU-bound,
in which case they perform better on DAFS. We also found that RDMA
is a less restrictive mechanism to achieve copy avoidance than that
used by the optimized NFS.
Alexandra Fedorova, Margo Seltzer, Kostas Magoutis,
and Salimah Addetia, "Application Performance on the Direct
Access File System'',
In Proceedings of Workshop on Software and Performance 2004 (WOSP'04) ,
January 14-16, 2004, Redwood City, CA.
Kostas Magoutis, Salimah Addetia, Alexandra Fedorova,
Margo I. Seltzer, "Making the Most out of Direct Access
Network-Attached Storage", In Proceedings
of Second USENIX Conference on File and Storage Technologies (FAST'03),
San Francisco, CA, March 31-April 2, 2003.
( get a copy )
Kostas Magoutis, Salimah Addetia, Alexandra Fedorova,
Margo I. Seltzer, Jeffrey S. Chase, Andrew J. Gallatin, Richard
Kisley, Rajiv G. Wickremesinghe, Eran Gabber, "Structure
and Performance of the Direct Access File System", In
Proceedings of USENIX Annual Technical Conference, Monterey,
CA, June 9-14, 2002. (
get a copy)
the project web site