|Margo's Current Students|
The ASC project is a research collaboration between Harvard University and Boston University to create a powerful, practical automatic parallelization runtime.
Traditional automatic parallelization techniques are not keeping pace with the widespread growth in parallelism in computing. ASC is a research project that studies an alternative approach to autoparallelism.
Rather than compiling sequential programs into parallel programs, our approach arises from the fact that executing a sequential von Neumann program on input data traces out a unique path through the state space of execution. If we could somehow partition that complete path into multiple smaller subpaths, then we could take advantage of all available processors to execute each subpath in parallel.
An interdisciplinary team of computer scientists and ecologists have come together to develop tools to facilitate the capture, management, and query of data provenance -- the history of how a digital artifact came to be in its present state. Such data provenance improves the transparency, reliability, and reproducibility of scientific results. Most existing provenance systems require users to learn specialized tools and jargon and are unable to integrate provenance from different sources; these are serious obstacles to adoption by domain scientists. This project includes the design, development, deployment, and evaluation of an end-to-end system (eeProv) that encompasses the range of activity from original data analysis by domain scientists to management and analysis of the resulting provenance in a common framework with common tools. This project leverages and integrates development efforts on (1) an emerging system for generating provenance from a computing environment that scientists actually use (the R statistical language) with (2) an emerging system that utilizes a library of language and database adapters to store and manage provenance from virtually any source.
This project creates a provenance-enabled data citation system that can both embed in an existing data platform (specifically, DataVerse) as well as function as a standalone service. The system will directly include executable transformations for a limited, but important set of tools: R and SQL. For other tools, it provides a standardized documentation capability to describe transformations. The system is sufficiently flexible to serve either as part of a publication workflow, where data is part of a more conventional publication, or in support of a standalone publication. It also provides data summaries.
Evolving Operating Systems
Large organizations invest enormous sums of money in software development, which result in large, difficult-to-maintain, complex systems. Over time, systems become increasingly brittle and more terrifying to modify, because no one understand the full system. When these systems are mission critical, they frequently continue to run for decades on outdated hardware, because no one knows how to migrate them to more modern system. The goal of this project is to develop tools and techniques to allow software to evolve, enabling migration to newer hardware platforms. In particular, we are investigating the use of hardware machine description languages as the foundation for developing tools that will let us synthesize systems for new hardware.
The most significant performance and energy bottlenecks in a computer are often caused by the storage system, because the gap between storage device and CPU speeds is greater than in any other part of the machine. Big data and new storage media only make things worse, because today's systems are still optimized for legacy workloads and hard disks. The team at Stony Brook University, Harvard University, and Harvey Mudd College has shown that large systems are poorly optimized, resulting in waste that increases computing costs, slows scientific progress, and jeopardizes the nation's energy independence.
At Harvard, we are currently investigating two areas: building storage systems for new SMR disk drives and developing systems to store, manipulate, and query large graphs.
Provenance Aware Storage Systems (PASS):
Provenance (also known as pedigree or lineage) refers to the complete
history of a document. In the scientific community, provenance
refers to the information that describes data in sufficient detail
to facilitate reproduction and enable validation of results. In the
archival community, provenance refers to the chain of ownership and
the transformations a document has undergone. However, in most
computer systems today, provenance is an after-thought, implemented
as an auxiliary indexing structure parallel to the actual data.
Provenance, however, is merely a particular type of meta-data. The operating system should be responsible for the collection of provenance and the storage system should be responsible for its management. We define a new class of storage system, called a provenance-aware storage system (PASS), that supports the automatic collection and maintenance of provenance. A PASS collects provenance as new objects are created in the system and maintains that provenance just as it maintains conventional file system meta-data. A PASS, in addition to collecting and maintaining provenance, also supports queries upon the provenance.
Currently, we have implemented a prototype that records relevant system activity and stores it persistently in an in-kernel database and responds to user queries about a file's provenance.
|Scalable Data Management This project addresses the challenge of making petabyte scale storage systems easily usable. The techniques we developed forty years ago are simply not up to the task of managing billions of files. We leverage provenance (a record of the data and processes that contributed to its creation), content, and other attributes to provide a scalable and searchable file namespace that tracks data as it moves through the scientific workflow.|
|Hourglass: The Hourglass project is building a scalable, robust data collection system to support geographically diverse sensor network applications. Hourglass is an Internet-based infrastructure for connecting a wide range of sensors, services, and applications in a robust fashion. In Hourglass, streams of data elements are routed to one or more applications. These data elements are generated from sensors inside of sensor networks whose internals can be entirely hidden from participants in the Hourglass system. The Hourglass infrastructure consists of an overlay network of well-connected dedicated machines that provides service registration, discovery, and routing of data streams from sensors to client applications. In addition, Hourglass supports a set of in-network services such as filtering, aggregation, compression, and buffering stream data between source and destination. Hourglass also allows third party services to be deployed and used in the network.|
|Copyright © 2009 Margo I. Seltzer||All rights reserved.|