The conventional methodology for system performance measurement, which relies primarily on throughput-sen sitive benchmarks and throughput metrics, has major limitations when analyzing the behavior and perfor mance of interactive workloads. The increasingly inter active character of personal computing demands new ways of measuring and analyzing system performance. In this paper, we present a combination of measurement techniques and benchmark methodologies that address these problems. We introduce several simple methods for making direct and precise measurements of event handling latency in the context of a realistic interactive application. We analyze how results from such measure ments can be used to understand the detailed behavior of latency-critical events. We demonstrate our techniques in an analysis of the performance of two releases of Windows NT and Windows 95. Our experience indi cates that latency can be measured for a class of interac tive workloads, providing a substantial improvement in the accuracy and detail of performance information over measurements based strictly on throughput.
In this paper, we propose a set of new techniques for performance measurement in which latency is measured in the context of a workload that is realistic, both in terms of the application used and the rate at which user- initiated events are generated. We present low-overhead methods that require minimal modifications to the sys tem for measuring latency for a broad class of interac tive events. We use a collection of simple benchmark examples to characterize our measurement methodol ogy. Finally, we demonstrate the utility of our metrics by applying them in a comparison of Microsoft Win dows NT versions 3.51 and 4.0 and Windows 95, using realistic interactive input to off-the-shelf applications.
The remainder of this section provides background on the problem of measuring latency, including the motiva tion for our new methodology based on an analysis of the current practice in performance measurement. Sec tion 2 describes our methodology in detail. In Section 3, we discuss some of the issues in evaluating response time in terms of a user's experience. In Sections 4 and 5, we apply our methodology in a comparison of Windows NT versions 3.51 and 4.0 and Windows 95. Sections 6 and 7 discuss the limitations of our work and conclude.
The insufficient detail provided by throughput bench marks can also mislead designers trying to identify the bottlenecks of a system. Since throughput benchmarks provide only end-to-end measures of activity, system activity generated by low-latency events cannot be dis tinguished from that generated by longer-latency events, which have a much greater impact on user-perceived performance. Worse, if such a benchmark includes suffi ciently many short-latency events, these short events can contribute significantly to elapsed time, leading design ers to optimize parts of the system that have little or no impact on user-perceived performance. In an effort to compare favorably against other systems in throughput benchmarks, designers may even undertake such optimi zations knowingly. In this case, bad benchmarking methodology hurts both system designers and end users.
In addition, user interfaces tend to use features such as blinking cursors and interactive spelling checkers that have (or are intended to have) negligible impact on per ceived interactive performance, yet may be responsible for a significant amount of the computation in the over all activity of an application. Throughput measures pro vide no way to distinguish between these features and events that are less frequent but have a significant impact on user-perceived performance.
When a benchmark uses an uninterrupted stream of requests, the system batches requests more aggressively to improve throughput. Measurement results obtained while the system is operating in this mode are meaning less; users will never be able to generate such an input stream and achieve a similar level of batching in actual use. Disabling batching altogether is sometimes possible but does not fully address the problem. An ideal test input should permit a level of batching that is likely to occur in response to real user input.
Overall, throughput measures provide an indirect rather than a direct measure of latency, and as such they can give a distorted view of interactive performance. An ideal benchmarking methodology will drive the system in the same way that real users do and give designers a correct indication as to which parts of the system are responsible for delays or user-perceptible latency. Obtaining such figures requires that we drive the system using an input stream that closely resembles one that an interactive user may generate and more importantly, an ability to measure the latency of individual events.
Batch benchmarks address the shortcomings of microbenchmarks by measuring the time to run a com plete, non-trivial computation from start to finish. Batch benchmarks are the dominant way of measuring system performance [1,13,14]. Portable benchmarks such as the SPEC95 suite are based on batch computations with non-interactive input. The portability requirement means that it must be possible to adapt these bench marks to use new system APIs or to accommodate the idiosyncrasies of a given API implementation. To meet these requirements, the SPEC benchmarks are distrib uted in source form. The central problem with bench marks such as the SPEC suites is that, although they may represent a realistic batch load, they fail to model the behavior of an interactive user [8].
Benchmarks such as Winstone, BAPCo SYSmark NT and BAPCo SYSmark 32 sacrifice portability in order to use popular interactive applications. Winstone is spe cific to PC-compatible hardware running Windows 95 or Windows NT. SYSmark NT provides somewhat more hardware portability than the Winstone suite, providing native executables for non-x86 systems that run Win dows NT. Although these benchmark suites use interac tive applications and simulated interactive input, they report performance in terms of throughput metrics and suffer from the problems presented in Section 1.1. Non- trivial batch computations are sometimes included in these workloads when they are consistent with realistic usage of the application. Examples are circuit board lay out in MAXEDA (BAPCo SYSmark NT) and database queries in Microsoft Access (Winstone 95). Although these benchmarks use realistic applications, the input streams used to drive them model an infinitely fast user.
BAPCo SYSmark for File Servers, Ziff-Davis Net bench, SPEC SFS (LADDIS), and the TPC benchmarks measure performance of a server under client load. Both Netbench and SPEC SFS measure file server perfor mance, and compute their results using the latency observed by clients. The intent of TPC-C is to measure on-line transaction processing performance. It reports performance in terms of complete transactions pro cessed per minute. TPC-D is designed to measure deci sion-support database performance. TPC-D specifies one metric (QppD) that measures "raw performance" or latency, and another metric (QthD) for throughput. Several of these benchmarks use client-observed latency in reporting performance, but only for client requests to a remote server. Our goal is to measure latency for a more general class of events.
A prior study compared Windows NT, Windows for Workgroups 3.11, and NetBSD to explore how differ ences in system structure affect overall performance [4]. The study used microbenchmarks to measure the latency of simple events and application workloads. Although some of the application workloads were based on inter active applications, the applications were driven by an uninterrupted input stream, and the results were reported in terms of throughput. There was no attempt to measure interactive performance directly. Because of the require ment that the workloads run on a Unix system, the study included no popular PC applications.
Two prior studies influenced our work. Shand used a free-running counter in a tight loop to measure the latency of interrupts [11]. This methodology is similar to ours in that it measures the computation of interest by detecting lost time. We generalize the prior work by looking beyond interrupt handling to a broader range of events types, by avoiding the need for special-purpose hardware, and by looking at interactive events rather than the latency of interrupt handlers. Second, our tech niques for visualizing latency were influenced by the work of O`Toole et al. [9] on reducing the pause times for garbage collection.
In the following sections, we describe the combination of tools and techniques that we use to measure and iden tify event latency.
The lack of kernel source code prevents us from instru menting the kernel to identify the exact times at which the processor leaves or enters the idle loop. Instead, we replace the system's idle loop with our own low-priority process in each of the operating systems. These low- priority processes measure the time to complete a fixed computation: N iterations of a busy-wait loop. The instrumentation code logs the time required by the loop. The pseudo code is as follows:
while (space_left_in_the_buffer) {
for (i = 0; i < N; i++)
;
generate_trace_record;
}
We select the value of N such that the inner loop takes one ms to complete when the processor is idle. In this way we generate one trace record per millisecond of idle time. If the processor is taken away from the idle loop, the loop takes longer than one ms of elapsed time to complete. Any non-idle time manifests itself as an elon gated time interval between two trace records. The larger we make N, the coarser the accuracy of our mea surements; the smaller we make N, the finer the resolu tion of our measurements but the larger the trace buffer required for a given benchmark run.
We wrote and measured a simple microbenchmark to demonstrate and validate this methodology. It uses a program that waits for input from the user and when the input is received, performs some computation, echoes the character to the screen, and then waits for the next input. We measured the time it took to process a key stroke in two ways. First, we used the idle loop method described above to measure the processing time. Figure 1 shows the times at which the samples were col lected.
For the sake of clarity only a few samples are shown. The figure shows that the system spent approxi mately one ms generating samples A, B, D, and E, indi cating that the system was idle during the periods in which these samples were generated, but spent 10.76 ms generating sample C. The difference, (10.76 - 1) or 9.76 ms, represents the time required to handle the event.
Next, we used the traditional approach, recording one timestamp when the program received the character (i.e., after a call to getchar()) and a second timestamp after the character was echoed back to the screen. This measurement reported an event-handling latency of only 7.42 ms. The 2.34 ms discrepancy between the two mea surements highlights a shortcoming of the conventional measurement methodology. Our test program calls the getchar() function to wait for user input. When the user enters a character, the system generates a hardware interrupt, which is first handled by the dynamically linked library KERNEL32.DLL. In the traditional approach, the measurement does not start until control is returned to the test program. Therefore, it fails to cap ture the system time required to process the interrupt and reschedule the benchmark thread. In comparison, our idle loop methodology provides a more complete measurement of the computation required to process the keyboard event.
Our idle loop methodology uses CPU busy time to rep resent event latency, but there are several issues that pre vent this from being an accurate measure of the user's perceived response time. One problem is that most graphics output devices refresh every 12-17 ms. In this research, we do not consider this effect.
Another problem is that CPU busy time and CPU idle time do not equate directly with wait time and think time. First, synchronous I/O requests contribute to wait time, even though the CPU can be idle during these operations. Second, in the case of background process ing, the user may not be waiting even though the CPU is busy. The first problem could be solved with system support for monitoring the I/O queue and distinguishing between synchronous and asynchronous requests. In order to address the second problem, we must consider how events are processed by the systems. When the user generates key strokes and mouse clicks, they are queued in a message queue to await processing. Therefore, when there are events queued, we can assume that the user is waiting. By combining CPU status (busy or idle), message queue status (empty or non-empty), and status for outstanding synchronous I/O (busy or idle), we can speculate during which time intervals the user is wait ing.
Figure 2 shows a state transition diagram for identifying think time and wait time in our system, using the param eters: CPU state, message queue state, and synchronous I/O status. The diagram omits asynchronous I/O, which we assume is background activity, and assumes that users always wait for the completion of an event. In real ity, we can never precisely distinguish think time from wait time, because we cannot know what the user is doing and whether the user is actually waiting for an event to complete or is thinking while an event is being processed. For simplicity, in the rest of this paper, we assume that the user waits for each event and report results in terms of event handling latency. In the next section, we describe how we obtain information about the status of the message queue.
Figure 2 illustrates our design for a finite state machine that distinguishes think time from wait time in a latency- measurement system. In Sections 4, 5, and 6, we will demonstrate how to apply complete information about CPU state and partial information about message queue state to implement part of the FSM. Implementation of the full FSM requires additional system support for monitoring I/O and message queue state transitions. Implementation of such monitoring is part of our con tinuing work at Harvard.
Next, we will present two simple example measure ments to give some insight into some of the non-trivial aspects of interpreting the output of our measurements.
Both versions of Windows NT show bursts of CPU activity at 10 ms intervals due to hardware clock inter rupts. Correlating the samples with a count of hardware interrupts from the Pentium performance counters shows that each burst of computation is accompanied by a hardware interrupt.
Although we have compensated for the overhead intro duced by the user-level idle loop, Windows 95 shows a higher level of activity in comparison to both versions of the NT system. We do not know what causes this increased activity in Windows 95.
By coupling our idle-loop methodology with the Pen tium counters, we were able to compute the interrupt handling overhead for various classes of interrupts -- measurements difficult to obtain using conventional methods. For example, the smallest clock interrupt han dling overhead under Windows NT 4.0 was about 400 cycles, or 4 ms.
Figures 4a and 4b show the results of our measurement in two different resolutions.
Figure 4a shows the full 1 ms resolution of the data, while Figure 4b shows the CPU utilization averaged over 10 ms intervals. Both fig ures clearly show the 80 ms of 100% CPU utilization required to process the input event (from 100 to 180 ms) and another period of 100% CPU utilization starting around 400 ms to redisplay the page. The stair pattern between 180 and 400 ms illustrates the CPU activity required to perform the animation. From Figure 4a, we can observe that the bursts of CPU activity for perform ing animation are aligned on 10 ms boundaries, suggest ing that they are scheduled by clock interrupts. Each step of animation takes progressively longer time to complete as the window outline increases in size.
This measurement shows that a single user event can correspond to multiple intervals of CPU busy time. Such events complicate the task of precisely identifying event boundaries. Monitoring the Message API (section 2.4) is one of the techniques that helps us pinpoint the begin ning and ending of interactive events.
We used Microsoft Visual Test to create most of our microbenchmarks and task-oriented benchmarks. MS Test provides a system for simulating user input events on a Windows system in a repeatable manner. Test scripts can specify the pauses between input events, generating minimal runtime overhead. However, in some cases, the way that Test drives applications alters the behavior of those applications. This effect is dis cussed in detail in Section 5.4.
However, we also believe that the threshold, T, is a func tion of the type of event. For example, users probably expect keystroke event latency to be imperceptible while they may expect that a print command will impose some delay. The issues of event types, user expectation, the precise tolerance of users for delay, and the limitations of human perception are beyond our field of expertise. Presented with these obstacles, we modified our plans, and present latency measurements graphically. We trust that the issues in human-computer interaction can be resolved by specialists. In the meantime, our visualiza tion of latency enables us to compare applications and develop an intuition for responsiveness without risking the inappropriate data reductions that could occur given our limited background in experimental psychology.
Figure 5 is an example of the graphical representation of our raw data. Each vertical bar represents an event that began at the time represented by the X value and lasted for a period represented on the Y axis. Figure 5a shows the data for an entire Microsoft Word benchmark run, while 5b shows a magnification of a two second interval. The complete event latency profile provides a very coarse view of the application, while the magnification provides the detail to explain the periodicity in the over all pattern. By drawing a horizontal line at a given "irri tation threshold," the frequency and distribution of irritating events is readily visible.
For our task-oriented benchmarks, we use three graphi cal representations to capture the responsiveness of an application. First, we present histograms, showing the number of events corresponding to each measured latency. This presents a detailed breakdown of the event latencies and provides some intuition into the different categories of events present in an application. Next, we integrate over the histogram presenting a cumulative latency graph. This provides the quantitative data indi cating how events of a particular duration contribute to the overall time required to complete a task. Finally, we plot the cumulative latency as a function of the number of events, providing an intuition about the variance in response time perceived by the user. Note that in each of these cases, the events are sorted by their duration, not by their actual time of occurrence.
shows the latencies for two simple interactive events, unbound key stroke and mouse click on the screen background, under the three operating systems. We were unable to measure the overhead of Microsoft Test for these microbenchmarks, so we were forced to use manual input. To compensate for the potential vari ability introduced by a human user, we report the mean of 30-40 trials, ignoring cold cache cases. The most sig nificant standard deviations occurred in the key click events for Windows NT 4.0 and Windows 95 (8%) while all the remaining standard deviations were under 2% of the mean.
On the key stroke test, Windows 95 shows substantially worse performance than NT 4.0. This is a reflection of segment register loads (not shown) and other overhead associated with 16-bit windows code [4], which persist in Windows 95.
The mouse click results are even more striking. The Windows 95 measurements are off the scale, because the system busy-waits between "mouse down" and "mouse up" events; therefore our measurement indicates the length of time the user took to perform the mouse click. This is much longer than the actual processing times of the NT systems and is not indicative of the actual Windows 95 performance.
Our methodology provides little guidance in explaining the differences in performance between Windows NT 3.51 and NT 4.0, but it does highlight the fact that instructions and data references occur roughly in pro portion to cycles across the systems for both of the sim ple interactive events. Therefore, we conclude that in the warm cache case, the performance differences are a function of the code path lengths. It is possible that the difference in code path length stems from the change in GUI between NT 3.51 and NT 4.0.
We ran each benchmark five times using Microsoft Test and found that the results were consistent across runs. The standard deviations for the elapsed times and cumu lative CPU busy times were 1-2%, and the event latency distributions were virtually identical. The graphical out put shown in the following sections depicts one of the five runs for each benchmark.
Notepad is a simple editor for ASCII text distributed with all versions of Microsoft Windows. Our Notepad benchmark models an editing session on a 56KB text file, which includes text entry of 1300 characters at approximately 100 words per minute, as well as cursor and page movement. With this benchmark, we demon strate how differences in average response time across the three systems manifest themselves in our visual rep resentation of latency and how they can be used to com pare system performance. We used the same Notepad executable (the Windows 95 version) on all three sys tems and used a Microsoft Test script to drive Notepad. Since virtually all Notepad activity is synchronous, we were able to collect the latency figures for every key stroke that the user made in a straightforward way. By correlating our idle loop measurement with our monitor ing of the PeekMessage() and GetMessage() API calls, we were able to clearly identify the Test overhead and remove it from the data presented in Figure 7.
Notice that the Y scale in the histogram in Figure 7 is a logarithmic scale. The cumulative latency graph shows that for all three systems, over 80% of the latency of Notepad is due to low-latency (less than 10 ms) events. These short-latency events are the keystrokes that gener ate printable ASCII characters. The remaining 20% of the total latency are due to the longer latency (at least 28 ms) keystrokes that cause "page down" or newline oper ations. These keystrokes cause Notepad to refresh all or part of the screen. The smoothness of the curves in the bottommost graph in Figure 7 shows that there is little variance in either the long latency events or the short- latency events. Events of the same type contribute equally to the total latency.
The latencies measured are relatively small for Notepad and reflect both the simplicity of the application and the relatively fast PC that we used for our experiments. Although these differences in latency are likely to go unnoticed by users of our test system, they might have a significant effect on user-perceived performance on a slower machine.
The shortest events in Figure 8 (with latency of less than one second) are due to "page down" operations and Excel operations. Both systems exhibited a similar latency distribution for these events. Six events had latencies greater than one second on both systems, in nearly the same relative order. Table 1 lists these long latency events.
All of the long-latency events required disk accesses, which are responsible for the majority of the latency for these events. The effects of the file system cache are most clearly observed in the latency for starting the sec ond OLE edit, as more of the pages for the embedded Excel object editor become resident in the buffer cache.
The cumulative latency graph shows that both versions of Windows NT demonstrate similar performance for the short-latency keystrokes, and the majority of the per formance difference is a result of the ability of NT 4.0 to handle the long-latency events much more efficiently. We turn to application microbenchmarks to examine this phenomenon in more detail.
The stan dard deviations are all below 3%.
Figure 9 shows some of the hardware events observed in each system during the page-down operation. The graph shows that NT 4.0 was able to handle the request in the shortest amount of time followed by Windows 95 and NT 3.51. The differ ence in the latency between the two versions of Win dows NT is explained by the differences in system architecture. In NT 3.51, the Win32 API is implemented by a user-level server. The negative performance effects of this server-based architecture were demonstrated in prior research [4]. In NT 4.0, some components of the Win32 API server are rumored to have been moved into the kernel. The improved locality from this change is reflected in reduced TLB misses for NT 4.0 compared to NT 3.51. A lower TLB miss rate implies fewer protec tion domain crossings in Pentium processors, which flush the TLB on each crossing [5]. Using 20 cycles per miss as a lower bound on TLB miss handling latency, the extra TLB misses that occur for NT 3.51 (both instruction and data) account for at least 25% of the latency difference between NT 3.51 and NT 4.0.
Comparing NT 4.0 and Windows 95, there are a rela tively large number of unaligned data accesses and seg ment register loads for Windows 95 (Figure 9). The high counts for these events are due to the large components of Windows 95 (such as the graphics API) that are implemented in 16-bit code. Windows 95 also incurs 93% more TLB misses than NT 4.0, although we do not have sufficient information to attribute this behavior to a specific architectural feature in the two systems.
Figure 10 shows some of the hardware event counts during the OLE edit start-up. Like the page down benchmark, NT 4.0 completes the operation with the shortest latency followed by Win dows 95 and NT 3.51. Comparing NT 4.0 to NT 3.51, the observations made for the page down benchmark carry through to the OLE edit. Overhead from elevated TLB miss rates account for at least 23% of the latency difference between NT 3.51 and NT 4.0. In Windows 95, we observed a large number of segment register loads and unaligned data accesses, both of which can be attributed to code executing in 16-bit mode.
Figure 11 shows results for Microsoft Test driven simu lations on the two versions of Windows NT. Compared to Notepad, Word requires substantially more process ing time per keystroke, due to additional functionality such as text formatting, variable-width fonts and inter active spell checking. For the majority of interactive events, NT 4.0 exhibits shorter response time and lower variance than NT 3.51.
The Microsoft Word benchmark demonstrates both the strengths and limitations of evaluating interactive per formance using latency. Compared to throughput mea surements, our latency analysis provides much more detailed information, such as variations in latency and the distribution of events with different latencies. How ever, the structural features of Word push us to the limit of the behavior we are able to analyze. Our analysis indicates that Word uses a single system thread, but responds to input events and handles background com putations asynchronously using an internal system of coroutines or user level threads.
Distinguishing background activity from foreground activity in Word is challenging. We examined the results of hand-generated Word input under Windows NT 3.51, compared it to the Test-generated results, and found sig nificant differences. For our hand-generated tests, we ran seven trials, with the same typist and input, and found that the event histograms appeared very similar and that the variation in cumulative latency and elapsed time was less than 4% across the runs. While the Test results showed that most events had latency between 80 and 100 ms, we measured a 32 ms typical latency for the hand-generated input. This difference in event latency was accompanied by a compensating difference in back ground activity. The hand-generated input showed a higher level of background activity than the Test-gener ated results. We also observed that carriage returns under the hand-generated input took longer than 200 ms to handle while the longest latency events we saw in the Test-generated runs were 140 ms. Our Message API log reveals that Test generates a WM_QUEUESYNC mes sages after every keystroke. We hypothesize that these messages were responsible for the different behavior under Test and under manual typing. However, with our current tools, the complexity of Word makes it difficult to thoroughly analyze even the simple experiment we present here.
Even in the presence of rich APIs, the task of distin guishing between wait time and think time is not always possible. There is no automatic way to detect exactly what a user is doing. Without user input, we can never tell whether a user is genuinely waiting while the system paints a complicated graphic on the screen or is busy thinking. For simulations using designed scripts, we can make assumptions about when users think and then ana lyze performance based on those assumptions, but the most useful analysis will come from evaluating actual user interaction.
One factor that contributes to user dissatisfaction is the frequency of long-latency events. We processed the Microsoft Word profile of Figure 5 to analyze the distri bution of interarrival times of events above a given threshold. Since most events in the Word benchmark were very short, we chose thresholds around 100 ms. Table 2 shows the summaries for these thresholds. Note that the standard deviations are of the same order of magnitude as the averages themselves, indicating that there is no strong periodicity between long-latency events.
----------------------------------------------------------------- | Threshold | Number of | Interarrival times | | | (in msec) | events | | | | | above | | | | | threshold | | | ----------------------------------------------------------------- | | | Average | Std. Dev. | | | | | (in sec) | | | | (in sec) | | ================================================================= | 100 | 101 | 3.1 | 3.1 | ----------------------------------------------------------------- | 110 | 26 | 12.4 | 10.6 | ----------------------------------------------------------------- | 120 | 8 | 41.1 | 48.8 | -----------------------------------------------------------------
We then examined the truly long-latency events from the Powerpoint benchmark. Figure 12 shows the event latency profile for all events over 50 ms. Both systems show similar periodicity with the better performing 4.0 system demonstrating smaller interarrival times to match its shorter overall latency.
In the case of Word, the interarrival times are clustered because most events have similar latency. In the case of Powerpoint, the interarrival times of long-latency events are simply the interarrival times of a few particular classes of events. The distribution of these events is entirely dependent upon when we issued such requests in our test script and is not necessarily indicative of the distribution that might be obtained from a real user. In this test, none of the simple keystroke events were responsible for generating long-latency events, rather all the events with latencies over 50 ms result from major operations for which user expectation for response time is generally longer. Until our tools become sophisticated enough to examine long traces of complex events gener ated by a real user, further analysis of these interarrival times is not particularly productive.
Over time, our tools will become better able to deal with the sophisticated applications that we seek to analyze, but we need the human factors community to assist us in understanding the limits of human perception and the models of user tolerance. Some of the questions that must be answered are:
We have combined a few simple ideas to get precise information about latency in interactive programs. We have shown that using these ideas we can get accurate and meaningful information for simple applications and also, to a degree, for complex applications. The require ments of these techniques are not out of reach; in partic ular, a hardware cycle counter, a means for changing the system idle loop, and a mechanism for logging calls to system API routines are needed. Additional support for detecting the enqueuing of messages and the state of the I/O queue would provide a more complete framework for latency measurement. We have shown the limitations of our system for applications such as Microsoft Word that use batching and asynchronous computation.
Measuring latency for an arbitrary task and an arbitrary application remains a difficult problem. Our experience with Microsoft Word demonstrates that there are many difficult technical issues to be resolved before latency will become a practical metric for system design. Our graphical representation provides a great deal of infor mation about program behavior to specialists, but is probably not appropriate for more widespread use. The two key components necessary to provide consumers a single figure of merit are further work in human factors and some method for distinguishing user think time from user wait time.
[2] Brian N. Bershad, Richard P. Draves, and Alessan dro Forin, "Using Microbenchmarks to Evaluate System Performance." Proceedings of the Third Workshop on Workstation Operating Systems, IEEE, Key Biscayne, Florida, April 1992, pages 148-153.
[3] Ben Smith, "Ultrafast Ultrasparcs," Byte Magazine, January 1996, page 139. Additional information on the Bytemarks suite is available on the Internet: http://www.byte.com/bmark/bdoc.htm.
[4] J. Bradley Chen, Yasuhiro Endo, Kee Chan, David Mazieres, Antonio Dias, Margo Seltzer, and Michael D. Smith, "The Measured Performance of Personal Computer Operating Systems," ACM Transactions on Computer Systems 14, 1, February 1996, pages 3-40.
[5] Intel Corporation, Pentium Processor Family Developer's Manual. Volume 3: Architecture and Programming Manual, Intel Corporation, 1995.
[6] C. J. Lindblad and D. L. Tennenhouse, "The VuSys tem: A Programming System for Compute-Inten sive Multimedia," To appear in IEEE Journal of Selected Areas in Communication," 1996.
[7] Larry McVoy, "Lmbench: Portable tools for perfor mance analysis," Proceedings of the 1996 USENIX Technical Conference, January 1996, pages 179- 294.
[8] Jeffrey C. Mogul, "SPECmarks are leading us astray," Proceedings of the Third Workshop on Workstation Operating Systems, IEEE, Key Bis cayne, Florida, April 1992, pages 160-161.
[9] James O'Toole, Scott Nettles, and David Gifford, "Concurrent Compacting Garbage Collection," The Proceedings of the Fourteenth ACM Symposium on Operating System Principles, December 1993, pages 161-174.
[10] John K. Ousterhout, "Why Operating Systems Aren't Getting Faster As Fast As Hardware." Pro ceedings of the Summer 1991 USENIX Conference, June 1991, pages 247-256.
[11] Mark Shand, "Measuring Unix Kernel Performance with Reprogammable Hardware," Digital Paris Research Lab, Research Report #19, August 1992.
[12] Ben Shneiderman, Designing the User Interface, Addison-Wesley, 1992.
[13] Jeff Reilly, "SPEC Discusses the History and Rea soning behind SPEC 95," SPEC Newsletter, 7(3):1- 3, September 1995.
[14] M. L. VanName and B. Catchings, "Reaching New Heights in Benchmark Testing," PC Magazine, 13 December 1994, pages 327-332. Further informa tion on Ziff-David benchmarks is available on the Internet: http://www.zdnet.com/zdbop/.