Abstract
The lmbench suite of operating system microbenchmarks provides a set of portable programs for use in cross-platform comparisons. We have augmented the lmbench suite to increase its flexibility and precision, and to improve its methodological and statistical opera tion. This enables the detailed study of interactions between the operating system and the hardware architecture. We describe mod ifications to lmbench, and then use our new benchmark suite, hbench:OS, to examine how the performance of operating system primitives under NetBSD has scaled with the processor evolution of the Intel x86 architecture. Our analysis shows that off-chip memory system design continues to influence operating system performance in a significant way and that key design decisions (such as suboptimal choices of DRAM and cache technology, and memory-bus and cache coherency protocols) can essentially nul lify the performance benefits of the aggressive execution core and sophisticated on-chip memory system of a modern processor such as the Intel Pentium Pro.
There are three standard approaches to quantifying and under standing OS performance: macro-benchmarking (i.e., running actual applications), profiling, and microbenchmarking. Macro-bench -marks, although useful for measuring end-to-end performance on a specific workload, involve too many variables to form a foundation for understanding the hardware basis of OS performance. Kernel profiling can be useful for determining the performance bottlenecks of an OS running a specific workload, but as a general purpose tech nique it has limited applicability because it requires often-unavail able source code for the OS kernel. This leaves microbenchmarks.
Until recently, the only widely-available OS microbenchmark suite was a set of tests developed by Ousterhout to measure the per formance of the Sprite operating system [12]. Ousterhout's bench marks isolate a number of kernel primitives and, when run across multiple platforms, provide some indication of the dependence of OS performance on machine architecture. However, they do not in clude enough detailed tests to characterize the capability of the un derlying hardware and to use that characterization to understand the performance of higher-level kernel primitives. In 1995, McVoy im proved the microbenchmark state-of-the-art with the introduction of his lmbench package: a suite based on a broad array of portable OS microbenchmarks capable of measuring both hardware capabil ities (e.g., memory bandwidth and latency) and OS primitives (e.g., process creation and cached file reread) [11]. Through its detailed tests, lmbench offered the possibility of decomposing operating system primitives into their hardware-dependent components, thus providing the tools necessary to develop an understanding of the re lationship between architectural features and operating system per formance.
On its own, however, lmbench is just a set of tools; for us to use it as the basis for understanding operating system performance, we needed to develop a benchmarking methodology: a means of us ing the tests and their results to build a decomposition of perfor mance from high-level primitives down to hardware details. To see if this was possible, we undertook an in-depth study of the perfor mance of the NetBSD operating system running on the Intel x86 ar chitecture. We chose NetBSD for its openness and its multi- platform support: having the source code meant that we could use kernel profiling to verify our techniques, and its multi-platform support provided the possibility of future cross-architecture com parisons. We selected the Intel x86 architecture as our subject ar chitecture due to its breadth: in its evolution from the i386 through the Pentium Pro, the Intel x86 architecture has progressively in cluded more and more of the advanced features that characterize a modern architecture, including pipelining, superscalar execution, and an out-of-order core with an integrated second-level cache.
As we used lmbench to probe into the details of NetBSD's per formance and its interaction with the x86 architectural features, we found that it had several shortcomings as a tool for the detailed sci entific study of OS-hardware interaction. Most notably, it did not provide the statistical rigor and self-consistency needed for detailed architectural studies. To resolve these shortcomings, we decided to revise lmbench into a suite of tests that would be useful for both cross-platform comparison and detailed system analysis--we wanted to fulfill the lmbench promise of providing a set of tools ca pable of illuminating the inner workings of an operating system in order to bring to light how that operating system's performance de pends on the hardware upon which it runs. Since lmbench provides a sufficiently complete set of tests to cover a broad range of operat ing system functionality, our modification efforts were directed at making the existing tests more rigorous, self-consistent, reproduc ible, and conducive to statistical analysis.
In this paper we first detail the modifications that we made to lmbench, then proceed to demonstrate how we applied our new benchmark suite, "hbench:OS," to the problem of understanding, at the most detailed levels, the dependence of OS performance on ar chitectural features of the Intel x86 platform.
The rest of this paper is organized as follows. In Section 2, we discuss the revisions we made to lmbench. Section 3 describes the methodology that we developed for using hbench:OS to analyze system performance. Section 4 presents the NetBSD case study, and we conclude in Section 5.
For the remainder of this paper, we use lmbench to refer to the original lmbench-1.1 test suite, and hbench:OS to refer to the mod ified test suite. Also, we will refer to the on-chip cache as the L1 (first level) cache and the secondary cache as the L2 (second level) cache.
| Test | Description |
|---|---|
| Memory read/write bandwidths | Determines the bandwidth to memory by timing repeated summing of a large array. |
| bcopy() bandwidth | Determines the memory bandwidth achieved by the bcopy() memory copy routine. |
| File reread bandwidth | Measures the bandwidth attainable in reading cached files from the system buffer cache. |
| TCP bandwidth | Measures the bandwidth attainable through an already-established TCP connection through the loopback interface. |
| Cached mmap-read bandwidth | Measures the bandwidth attainable when reading from a cached file mapped into the process's address space. |
| Process creation | Measures the latencies of three different methods of process creation: via a simple fork(), fork()+exec(), and system(). |
| Signal handler installation | Measures the latency of installing a new signal handler from a user process. |
| TCP connection latency | Measures the latency of setting up a TCP connection across the loopback interface. |
We made two modifications to avoid the timer resolution con straint imposed by gettimeofday(). First, we modified each benchmark to run its tests in an internal loop, timing the entire loop and reporting the average time (total time divided by number of it erations). While many of the lmbench latency tests already used such internal loops, the loops were run an arbitrary, predetermined number of times, causing scalability problems on different-speed systems. To fix this, we modified the internal loops to run for a min imum of one second, calculating the number of iterations dynami cally. The dynamic calculation of the iteration count ensures that the running time of the benchmark will exceed any reasonable timer resolution by a factor of 10 or more, regardless of the system or CPU being used. In addition, the inclusion of internal iteration with the bandwidth tests makes possible precise measurement of the memory and copy bandwidths to the L1 and L2 caches.
For benchmarks where the measurement is destructive and can only be taken once (for example measuring the virtual memory and TLB overhead in reading a memory mapped region), the loop-and- average method is not effective. For these tests, we had to appeal to a hardware-specific solution to gain the timing accuracy needed: we introduced hooks to allow hardware cycle counters (which tick at the CPU's internal clock speed) to replace gettimeofday() for timing. Currently hbench:OS only supports the Pentium and Pentium Pro counters, but adding support for other architectures (such as the Alpha or SuperSPARC) is not difficult. Note, however, that if an architecture supports no hardware counters/timers, it is not possible to measure such destructive events accurately.
Adding the hardware-timer hooks also significantly enhances the flexibility of the hbench:OS package, as the high-resolution timers give hbench:OS the capability of measuring events with low latencies without the need to run the event in a loop, thus allowing collection of cold-cache performance numbers. When using the gettimeofday() timing method, only warm-cache results can be measured, as the loops that are required for accuracy also allow the benchmark to run entirely from the cache.
Our last modification to the timing routines was to include code to measure and remove the overhead introduced by the timing mechanism (either the gettimeofday() system call or the in structions to read the hardware counters). Removal of this timer overhead is essential, especially when using the hardware timers to measure single low-latency events. When combined with the use of the hardware counters, this allows for precise timing measure ments: on a 120 MHz Pentium, for example, our timings are accu rate to within one clock cycle, or 8.3 ns.
The most-used statistical policy in lmbench is to take the min imum of a few repetitions of the measurement; this is intended to pick out the best possible result by ignoring results contaminated by system overhead. However, in doing so it can pick out results that are flukes--especially when the measurement involves subtracting an overhead value, as in the context-switch latency benchmark. If the actual overhead on a specific run is lower than the pre-calculat ed overhead, an abnormally good result will be obtained when the pre-calculated overhead is removed from the result. To avoid these problems, in most cases we take an n%-trimmed mean of the re sults: we sort the results from a benchmark, discard the best and worst n% of the values, and average the remaining (100-2n)%. n is typically 10%. With this policy, we discard both the worst values resulting from extraneous system overhead as well as the overly- optimistic results.
For certain benchmark tests, however, a simple trimmed mean is not sufficient to capture all of the important features of the re sults. This is particularly noticeable when the results of a test do not approximate a normal distribution, but are (for example) bimodal. Such cases are easily detected by their large standard deviations, and since all data is preserved, it is easy to view the actual distribu tion of the data to determine the best interpretation. We encoun tered this problem in measuring L2 cache bandwidth, as cache conflicts within our test buffer produced a bimodal distribution where the true bandwidth was represented by a large, narrow peak and the false (conflict) bandwidth was represented by a lower peak with larger spread. In this case, we merely increased the percentage that was trimmed from the data in order to isolate only the true bandwidth peak.
Finally, we have modified the benchmarks (where possible) to perform one iteration of the test before beginning the real measure ment. Since we run most of the tests in loops anyway, we expect warm-cache results. Running the test once before commencing measurement ensures that the caches are primed and that any need ed data (e.g., files in the buffer cache) are available.
Note that in gathering the results in this paper, we ran each benchmark (each of which runs a large number of internal itera tions) fifty times on all machines but the 386-33 and 486-33 (due to limited access to the hardware, only five iterations were performed on these machines), and we report the 10% trimmed average across these iterations. Standard deviations are represented by error bars in the graphs; in all cases standard deviations were less than 1% (and are frequently not visible in the graphs) except in the file reread benchmark, which produced standard deviations of less than 5%.
Although measuring cache conflict overhead is useful (espe cially for estimating context-switch time for large processes), lm bench's context switch benchmark demonstrates that there are problems with this approach that make it infeasible for a portable context switch benchmark. The most significant problem occurs when the operating system does not support intelligent page color ing, i.e., it chooses physical addresses for virtual pages randomly. To understand why this is a problem for lmbench, we need to inves tigate how lmbench collects its context switch latency data.
The lmbench context switch latency benchmark measures the time to pass a token around a ring of processes via pipes; to dupli cate the effect of a large working set, each process sums a large, pri vate data array before forwarding the token, thereby forcing the pages of the array into the cache. When the total time for this oper ation is divided by the number of processes in the ring, lmbench is left with a number that includes the raw context-switch time, the time to fault the array into the cache, the time to sum an already- cached array, and the time to pass a token through a pipe. The latter two factors are measurement overhead and must be removed. To do so, lmbench passes a token through a ring of 20 pipes within one process, summing the same data array each time the token changes pipes, then divides by the number of times the token went through a pipe. The problem with this approach is that the test assumes that summing the buffer produces no unnecessary cache conflicts, for the summing overhead should not include any cache-fill time. However, if the virtually-contiguous pages of the buffer are ran domly assigned to physical addresses, as they are in many systems, including NetBSD, then there is a good probability that pages of the buffer will conflict in the cache, even when the size of the buffer is smaller than the size of the caches [3]. Thus the overhead will con tain some cache-fill time, and as a result might be too high; if the actual context switch test obtains good page mappings, the over head may even be so high that when lmbench subtracts it from the total time to get just the context-switch latency, the resulting (re ported) context switch latency is negative or zero. A similar prob lem exists if the overhead-measurement test obtains good page mappings while the real context switch latency test obtains conflict ing mappings; here the overhead will be too small, and the reported context switch latency will be too large.
Because with lmbench there is no guarantee of reproducible context-switch latency results in the absence of OS support for in telligent page coloring, we decided, in hbench:OS, to restrict the test to measure only the true context switch time, without including the cost to satisfy extra cache conflicts or to fault in the processes' data regions, as these can be approximated from the cache and memory read bandwidths. To this end, we introduced a new context switch latency benchmark to supplement the existing lmbench test. We did not replace the lmbench test completely, as it can be useful in estimating user-visible context-switch latencies for applications with a known memory footprint, and for determining cache associa tivity. In our new test, context switch latency is a function of the speed of the OS in selecting a new process and changing the hard ware state to run it. To accomplish this, we carve each process's data array out of a large region shared between all the processes in the ring. To compute the overhead for nproc processes, we measure the time to pass a token through nproc pipes in one process, sum ming the appropriate piece of the shared region as the token passes through each pipe. Thus we duplicate exactly what the real context switch test does: we use the same memory buffers with the same cache mappings, and touch them in the same order. When we sub tract this overhead from the context switch measurement, we are left with the true context switch time plus any hardware-imposed overhead (such as refilling non-tagged TLBs and any cached data that got flushed as a result of the context switch but not as a result of faulting in the process). With these modifications, we can obtain results with a standard deviation of about 3% over 10 runs, even with large processes, and without having to flush the caches. In contrast, on the same machine, lmbench reports results with stan dard deviations greater than 10%.
movl $1, (%ebx)
movl $1, 4(%ebx),while a similar example using pointers (*ebx++ = 1; *ebx++ = 1;) is implemented as:
movl $1, (%ebx)
addl $4, %ebx
movl $1, (%ebx)
addl $4, %ebx.Depending on how the processor's pipeline handles interlocks, the two methods can produce different timings. For example, on the Alpha processor, memory read bandwidth via array indexing is 26% faster than via pointer indirection; the Pentium Pro is 67% faster when reading with array indexing, and an unpipelined i386 is about 10% faster when writing with array indexing. To avoid errors in interpretation caused by these discrepancies, we con verted all data references to use array-offset addressing. In addi tion, we modified the memory copy bandwidth to use the same size data types as the memory read and write benchmark (which use the machine's native word size); originally, on 32-bit machines, the copy benchmark used 64-bit types whereas the memory read/write bandwidth tests used 32-bit types.
However, hbench:OS used in a vacuum will not provide the information needed to understand the interactions and performance characteristics of a given system. Thus we next propose a method ology for analyzing and interpreting its results. The benchmarks in hbench:OS can be roughly divided into two layers: one that quanti fies hardware capabilities (such as memory bandwidth), and anoth er that measures the primitive functionality that is exported from the kernel to applications (such as system calls, process creation, and file/network access). When these layers are combined with the hardware at the bottom and OS performance at the top, a hierarchi cal structure emerges.
Figure 1 depicts this hierarchical structure as a pyramid of relationships between layers representing components of OS performance. Our goal in creating a methodology for hbench:OS was to provide a means of reconstructing the interde pendencies in the pyramid. Thus, when the methodology is applied, the result is a decomposition of the performance of the highest-level kernel primitives (those seen by kernel-dependent applications) into the performance of the underlying hardware; such a decompo sition can then be used to predict which architectural features influ ence OS performance.
Our methodology consists of two steps: first, using hbench:OS to measure performance at each level of the pyramid while varying features of the hardware; and second, using the changes in hard ware as well as software analysis (via hardware counters, software profiling, or code analysis) to relate the performance of primitives in a given layer to the performance of layers above and below it. Once the interaction between each pair of adjacent layers is under stood, the pyramid of performance dependencies can be recon structed.
In many cases, the hbench:OS tests provide enough detail about the individual layers of the pyramid so that such a reconstruc tion is possible: both bulk data transfer primitives and process cre ation primitives can be decomposed using the pyramidal model. In the other cases, where hbench:OS does not provide detailed bench marks to quantify the hardware capabilities used by a higher-level primitive, it is necessary to bypass the middle layer of the pyramid and determine directly the hardware dependence of each test. These results can be gleaned from the information obtained by analyzing how the results of a particular benchmark change when the hard ware is varied in a controlled manner (i.e., when only one compo nent of the system is changed at a time). We found this technique especially useful for relating the lowest-level hbench:OS primitives (such as memory read bandwidth) to features of the hardware archi tecture.
Comparisons between benchmark results from various subsets of our test machines reveal dependencies on features of the CPU ar chitecture and the memory system. For example, comparing the 100 MHz Endeavor Pentium with the 100 MHz Premiere-II Pen tium reveals the effect of pipelining the L2 cache and installing EDO memory; similarly, comparisons between the 90, 100, and 120 MHz Endeavor Pentiums reveal the effects of increasing the CPU clock rate while holding the memory system constant.
| Name-MHz | Caches Features | Memory/Bus-MHZ | Processor |
|---|---|---|---|
| 386-33 | no L1, 64K async. L2 | 70 ns, 33 MHz | i386DX |
| 486-33 | 8K combined L1, 256K async. L2 | 60 ns, 33 MHz | i486DX |
| 486-66 | 60 ns, 33 MHz | i486DX2 | |
| Endeav-90 | 16K split L1, 512K pipeline-burst L2 | 60 ns EDO, 60 MHz | Pentium (i430FX chipset) |
| Endeav-100 | 60 ns EDO, 66 MHz | Pentium (i430FX chipset) | |
| Endeav-120 | 60 ns EDO, 60 MHz | Pentium (i430FX chipset) | |
| Prem-100 | 16K split L1, 512K async. L2 | 70 ns, 66 MHz | Pentium (i430NX chipset) |
| Pro-200 | 16K L1, 256K L2, both write back and on-chip | 60 ns EDO, 66 Mhz | Pentium Pro (i440FX chipset) |
Applications that rely on bulk data transfer use one of three methods to access their data: reading from a file in the file system, sending and receiving data on a TCP connection, or mapping a file into their address space. Since each of these data-access methods involves a significant number of memory accesses, we can base our decomposition on the hbench:OS tests that measure the hardware memory read, write, and copy bandwidths. If we ignore the effects of disk and network latency (since we run all of our disk tests within the buffer cache and all of our network tests on the loopback inter face), we arrive at the decomposition shown in Figure 2
. There is also a CPU computation component in each of the application-level primitives; it is most significant in the TCP test due to the complex ity of protocol encapsulation and checksumming.
The hbench:OS tests that can be used to quantify the hard ware's capability for bulk data transfer, i.e., those which measure the bottom layer of Figure 2, are the raw memory bandwidth tests, which measure effective software read and write bandwidths--the attainable bandwidths when array-addressing operations (needed to index through memory) are inserted between each memory refer ence. Although the raw hardware transfer bandwidths are potential ly higher, the software bandwidths are more representative of what is attainable by actual code.
plots the peak raw bandwidths for reading from and writing to both caches and main memory of several of the test ma chines. The almost 4-fold improvement in L2 and main memory read performance between the 486-66 and the Prem-100 is due to increased bus bandwidth and bus burst capability. The Pentium sys tem has a 64-bit data bus, twice as wide as the 486's 32-bit bus; in addition, the Pentium supports burst transfers from the system's fast page mode DRAM, while the 486 does not. The measured write performance only doubles from the 486-66 to the Prem-100, be cause the older chipset on the Premiere system does not burst writes to DRAM; only the wider path to memory plays a role in the speed up compared to the 486.
The write performance of the Endeav-100 doubles that of the Prem-100 because of the Endeavor motherboard's pipeline-burst L2 cache and EDO DRAM. The pipeline-burst cache can latch three out of every four memory references in one bus cycle each and then burst them off to the DRAM. This explains why the main memory write bandwidth is comparable to the L2 cache's inherent read bandwidth--the pipelined cache is hiding much of the already- low DRAM latency from the CPU. Note that on the Endeav-120, which shares the same memory subsystem as the Endeav-100 and Endeav-90, the DRAM and L2 read bandwidths are higher than ex pected from comparison with the Endeav-90, since the processor is clocked at an integral multiple of the memory bus speed. This al lows the Endeav-120 to utilize more of the bus bandwidth (61% vs. 54% for the Endeav-100, as determined with the Pentium hardware event counters) since CPU and bus cycles coincide more frequent ly.
It is interesting to note that in the raw memory bandwidth tests, the dual-issue capability of the Pentium is being very poorly uti lized. We instrumented hbench:OS with the Pentium's built-in hardware counters, and discovered that when memory is accessed by summing an array using array-offset instructions, less than 0.1% of the memory instructions are parallelized. Similar results are found when the built-in string opcodes are used. Parallelism can be increased to nearly 50% by using pointer arithmetic to step through the array. In this case, each pointer increment is issued along with a memory reference, and is essentially free; however, two memory references are never issued simultaneously. In addition, this extra parallelism is introduced at the cost of an extra stall cycle on each memory access due to address generation interlocks. Thus both methods of memory access provide approximately the same perfor mance, so we predict that memory intensive workloads may profit less than expected from the superscalar architecture of the Pentium.
This conclusion also raises the interesting issue of the useful ness of micro-optimizing compilers for the OS kernel. We experi mented with the PCG version of pgcc (an adaptation of the GNU gcc compiler that performs aggressive instruction scheduling for the Pentium pipelines) and discovered that pgcc's optimizations had essentially no effect on the performance of the memory-inten sive benchmarks, even when the memory accesses were explicitly coded (as opposed to using the built-in string operations). The prob lem is that the hardware itself does not allow dual-issue of memory references in the cases we tested, and thus no instruction scheduling policy could improve performance in these cases.
Returning to the data in Figure 3, we see that the most spectac ular feature is the performance of the Pentium Pro system. The Pro- 200 exhibits a strange combination of impressive across-the-board memory bandwidth, except for uncharacteristically poor main memory write bandwidth. The Pro-200's nonblocking write-back L1 cache gives it an extreme performance advantage over the Pen tiums on small cached reads and writes. The Pro-200 L2 cache also significantly outperforms that of the Pentiums, as the Pentium Pro runs its on-chip, lockup-free L2 cache at the CPU clock speed, as opposed to the system bus speed. Also, while the Pentiums' non- write-back caches access memory on every write, the Pentium Pro's write-back caches are intelligent enough to combine writes into cache-line-sized increments, resulting in cached write perfor mance that nearly equals cached read performance, as the write- back cache is not forced to read a line before writing to it. The as tounding cache performance on the Pro-200 suggests that write- back caches offer a major performance advantage to those applica tions that perform bulk data transfer in small, cache-sized chunks, for example, the size of a typical HTML file.
Along with its high cache performance, the Pro-200 also sports exceptionally high main memory read bandwidth. In fact, the 216 MB/s that it achieves approaches the 226 MB/s theoretical maximum bandwidth out of 3/2/2/2-clocked EDO DRAM on a 66 MHz bus. The reason for this exceptional performance is twofold. First, the Pentium Pro sports an out-of-order execution engine that is capable of reordering memory reads and removing the data de pendencies implicit in the benchmark. By using register renaming and speculative memory reads, the Pentium Pro can implicitly batch and prefetch data reads, thus allowing it to issue memory reads as fast as the external memory system can handle them. Sec ond, and more importantly, the Pentium Pro's pipelined, transac tion-based system bus allows it to issue consecutive back-to-back data read transactions without incurring bus turn-around time and transaction set-up costs [8]. In contrast, the Pentium executes all memory operations in sequence, inserts extra data dependency stalls due to its small register set, and negotiates for the system bus on each read request.
The Pro-200's main memory write bandwidth, in contrast, is exceptionally low--almost 18% slower than the write bandwidth of the Endeav-100, a system with identical DRAM and the same bus speed. To determine why this was the case, we instrumented the benchmark with the Pentium Pro's built-in hardware counters [9]. For each 32-byte line of data written by the CPU, the counters indi cate that two bus transactions take place: a writeback transaction and a read-for-ownership (RFO) transaction. The writeback is ex pected, since as the CPU stores a line into the cache it must displace an existing dirty line from a previous write. The RFO on the line about to be written is used to guarantee cache-coherency: the CPU must ensure that no other CPU in the system has a dirty copy of the line it is about to write. However, there is no need for a read-for- ownership transaction in our case, as the Pro-200 is a single-pro cessor system, and thus there are no other CPUs that could contain a dirty line; there is similarly no need to read the entire line, as we have seen in the L2 cache bandwidth that the write-back cache is in telligent enough not to load a line that is about to be entirely rewrit ten. Thus by interspersing a RFO transaction between each write, the available bus bandwidth drops significantly, as the CPU must renegotiate for the bus on each write, instead of performing back- to-back writes (as it does in the read case). Also, there is the bus overhead of the read-for-ownership transaction itself, and the bus turn-around time needed to switch between the transactions. Thus it seems that requiring the demonstrably high overhead of a RFO- based cache coherency protocol even when there is only one CPU in the system is a suboptimal design, as it severely cripples the available memory write bandwidth on the Pentium Pro.
It appears that Intel may have attempted to compensate for this design by including an undocumented "FastStrings" flag in one of the Pentium Pro's control registers: when FastStrings are enabled, the RFO transactions are converted to Invalidate transactions (so the cache does not read the new line but merely invalidates it in oth er CPUs). However, on a single-CPU system the Invalidate trans action is still unnecessary since there is only one cache on the bus. Additionally, this feature only improves DRAM write bandwidth slightly (about 5%) and only when certain string instructions are used to perform the write; converting the RFOs to Invalidates does not remove the bus transaction and renegotiation overhead, the ma jor factor in the low DRAM write bandwidth.
The primary kernel primitive relied upon by bulk-data appli cation primitives is bcopy, used to transfer data around the kernel and between kernel and user space. Our bcopy benchmark uses the libc bcopy routine (identical to kernel bcopy in NetBSD) to copy both cached and uncached buffers in user space; this routine uses the x86 string instructions to efficiently move data. In the ideal case, we expect the results of the bcopy benchmark be one-half of the harmonic mean of the read and write bandwidths for each ma chine, since each byte copied requires one read and one write. How ever, when reads and writes are combined into copies, unexpected interactions can develop and cause the measured copy bandwidth to exceed or fall short of the half-harmonic mean prediction. These are the more interesting cases, as they illustrate optimizations or flaws in the hardware design, and how such design characteristics affect performance. In Figure 4
, we present the results of the non-cached bcopy test along with the half-harmonic means calculated from the raw bandwidth results in Figure 3; the cached bcopy results are sim ilar. For all systems but the Pro-200, the graph shows the expected result that bcopy bandwidth is directly correlated with the raw memory bandwidths. The measured results slightly exceed the pre dictions in most cases because the CPU (executing the x86 string operations) can issue the reads and writes back-to-back, without de coding and executing explicit load and store instructions.
Those machines with poor raw write bandwidth suffer in the bcopy test, since both read and write bandwidths have an equal in fluence on the copy bandwidth: for example, although it uses the same processor, the Prem-100 achieves only half the bcopy band width of the Endeav-100. This again demonstrates the effectiveness of an enhanced memory subsystem with a pipelined L2 cache and EDO DRAM at improving performance of operations requiring the movement of large quantities of data. The disappointing write per formance of the Pentium Pro memory system on large buffers com pletely negates the advantages of its advanced cache system, resulting in bcopy performance that is actually worse than that of some of the Pentiums. The Pro-200's copy bandwidth also falls far short of our prediction, since, when copying, the processor cannot issue back-to-back reads on the system bus, and must alternate read, write, and read-for-ownership transactions; each new transaction requires setup and negotiation overhead. Again, enabling Fast Strings on the Pentium Pro has little effect (less than 1%) because the extra coherency transaction is still present.
With this understanding of bcopy, we move on to consider the application-level data-manipulation primitives: cached file read, lo cal TCP data transfer, and mmap()'d file read. We will only consid er the first two methods here; full details on mmap()'d file read can be found in a more detailed report on this system comparison [2]. From the decomposition presented earlier in Figure 2, we expect that a significant component of the attainable bandwidths for each of these primitives is due to a dependence on the memory system, and thus we expect that the architectural changes that have en hanced memory system performance (such as faster, wider busses and pipelined caches) will enhance the performance of these prim itives as well. We now examine each of these three bandwidth mea surements in order to determine if this is the case, or if other factors besides the memory system are involved.
The hbench:OS file reread benchmark measures the band width attainable by an application reading data from a file present in the kernel's buffer cache; we used 64KB read requests for this test. For each byte transferred in this test, NetBSD performs one memory read from the kernel's buffer cache, one memory write to the user buffer, and a final memory read as the benchmark touches the byte. This is one more memory read than the bcopy test, so one might expect file reread to be significantly slower than bcopy. Sim ilarly, the TCP bandwidth test involves transferring 1MB in-mem ory buffers over the local loopback interface. In this test, each byte transferred must be copied three times, so we expect at least a 3-fold performance degradation relative to bcopy.
The results for these two benchmarks on several of our test machines are shown in Figure 5
along with predicted results de rived from the bcopy test and raw bandwidth tests. The TCP band widths show the expected pattern: the relative performance is comparable to that of bcopy, while the magnitude is approximately one-third that of bcopy. As expected, there is a partial CPU depen dency, since TCP's checksumming and encapsulation require more processing than bcopy; the Pentium Pro's out-of-order execution allows it to overlap some of the computation and memory referenc es involved in the TCP processing, giving it a slight performance edge. However, it is clear that the memory system still dominates TCP transfer bandwidth.
The file reread results also show similar relative performance to the bcopy results, with the exception of the Pentium Pro. Al though the predicted bandwidth again far exceeds the actual due to bus turnaround time that was not included in the raw read band width, this machine still far outclasses the Endeavor-based Pen tiums on this test despite its slower main memory system and poor bcopy performance. The reason for this is that the 64KB transfer buffers all lie entirely in the fast write-back L2 cache for the dura tion of the benchmark. If the read request size is increased to 1MB, larger than the 256KB L2 cache, the performance drops by a factor of two, as the buffers fall out of the L2 cache. The same effect can be seen in the TCP bandwidth test: using 64KB socket buffers in stead of the default 1MB buffers increases performance 200% from the value in Figure 5. These results suggest that a fast write-back L2 cache can provide a significant advantage to an application that pro cesses large amounts of data using a single buffer that fits within the L2 cache; if the buffer is large or if the application does not reuse the same buffer repeatedly, the overhead of faulting-in cache lines over a slow bus significantly reduces the write-back advantage.
Thus, in the case of the bulk data transfer primitives that an OS-dependent application might use, our decomposition is com plete: the user-visible primitives of cached file reread and TCP data transfer are nearly entirely dependent on the memory system, and therefore it is features of the memory system that will most affect the performance of these primitives. The Endeavor-based Pentium results imply that for high-bandwidth applications, a main memory system based on fast DRAM technology (such as EDO memory) is essential. The Pro-200's performance suggests again that eliminat ing unnecessary cache-coherency and bus transaction overhead will increase its performance greatly. It also suggests that intelligent, non-blocking, write-back caches are a net performance win both when reading large amounts of data and when handling data in units small enough to be cached, despite the delays that can be incurred in fetching lines upon write. In fact, analysis of these benchmarks with the Pentium Pro hardware counters shows that, while transfer ring large amounts of data, the Pentium Pro rarely needs to read en tire lines before writing into them, as the cache is intelligent enough to accumulate line-sized writes. Thus we conclude that large im provements to the CPU's execution unit (as in the Pro-200) may have a much less visible effect on high-bandwidth applications than small improvements in the memory subsystem (i.e., the use of a non-blocking write-back or pipeline-burst cache). Since multime dia applications and even the X Windows server transfer large quantities of data via the application primitives we have considered here, making these simple memory-system optimizations is crucial to attaining high performance.
In order to isolate each component of process creation, we de compose the more complex operations (e.g., process creation via /bin/sh) into the fundamental operations that we can measure. By subtracting fork latency from the combined fork and exec, we de rive exec latency. The /bin/sh case is somewhat more complicated in that it consists of:
We begin our analysis of these results with the one process creation metric that hbench:OS directly exposes: the cost of a fork, represented by the lowest sections of the bars in Figure 6
. Compar ing the fork cost across the suite of test machines reveals that the fork cost is primarily dependent on the memory system, although it does have a small clock-speed or CPU dependent component. Both the 486-33 and the 486-66 (which share the same memory system) demonstrate approximately the same fork latency; the 486-66 is slightly faster, highlighting the small CPU component. Similarly, the Prem-100, with its slower memory system, exhibits larger fork latency than its Endeavor-based counterpart. The Pentium Pro out performs the Pentium due to both the CPU component of the test and the small-write-biased nature of the test: a fork on NetBSD/x86 involves building and zeroing a page table structure that fits in the Pentium Pro's write-back L2 cache.
Next we consider the exec latency, which we decomposed from the high-level hbench:OS tests by subtracting the fork latency from the fork+exec latency. The same comparisons as above reveal primarily a memory-system dependence for the static case: the OS must demand-copy the executable from the in-memory file system buffer cache. In this case, the CPU dependent component is mini mal, and most likely results from the actual execution of the hello- world program. The exec latency in the dynamically-linked case has quite a different pattern. First, the latency is exceptionally large due to the cost of loading and mapping the shared libraries. We still observe a significant memory-system dependency, but the CPU de pendent component has grown due to the need to build and initial ize jump tables for the libraries. This is again evident in comparisons between the 486-33 and 486-66, and between the Prem-100 and the Endeav-100: in the first case, the 486-66 outper forms the 486-33, but not as much as pure CPU scaling would sug gest; in the second, the 100 MHz Pentium on the Endeavor motherboard outperforms the same chip on the Premiere mother board. Again, since the benchmark fits in its L2 cache, the Pro-200 performs well on this test, but still not significantly better than the Endeavor Pentiums.
Finally, having used the high-level hbench:OS tests to extract the fork and exec latencies, we use these results to complete our de composition by analyzing the overhead imposed by using the shell to invoke the hello-world process via the system() routine. If we consider the decomposition in Figure 6, we see that the /bin/sh overhead includes only the time involved in exec'ing /bin/sh and forking /bin/sh; the original fork to start the shell and the exec of hello-world are already accounted for. Comparing the /bin/sh over head across the various test machines, we again see a heavy mem ory system dependency, just as we saw for the statically-linked fork and exec latencies. This is because the fork and exec components of the /bin/sh overhead are directly related to these fork and static- exec latencies, since under NetBSD, /bin/sh is statically linked. However, the magnitude of the /bin/sh overhead is significantly greater than the magnitude of the static hello-world fork and exec; this is because the shell binary is almost seven times larger than the statically-linked hello-world binary, so the memory component in volved in paging in the executable and initializing its mappings is proportionally larger. When "hello-world" is dynamically linked, the shell overhead is only slightly larger due to the extra overhead of managing the shared library mappings when starting the shell.
Thus, in process creation, we have an example of an alternate method of performance decomposition via hbench:OS: in this case, we began with the measured performance of high-level operations (process creations) and massaged these data to extract the perfor mance of the primitive operations upon which the high-level oper ations are based (such as fork and exec latencies). We then applied our cross-platform comparison technique to understand the hard ware basis for the performance of the low-level primitives. The in escapable conclusion is that, yet again, the memory system dominates performance: all of the primitive latencies, and the high- level process creation latencies, depend primarily on the memory system, and include only a small CPU-dependent component. Thus the Pentium Pro's performance margin over the Pentium systems is due not to its advanced out-of-order core, but rather to its speedy on-chip cache system.
plots the results from this benchmark on some of our test machines.
The results indicate that, with the exception of the Pro-200, signal handler installation latency is entirely dependent upon CPU clock rate within each CPU class. The Endeav-100 and Prem-100 both perform almost the same number of installations per second despite their disparate memory systems; the 486-66 doubles the 486-33's performance, and the Endeav-120 performs about 120/90 more installations per second than the Endeav-90. Comparisons be tween CPU classes suggest that there is a subtle performance de pendence on more than the raw instruction execution rate: the 386 achieves less than half the performance of the 486 at the same clock speed, and the 486s obtain only about half the performance that a Pentium would at the same clock rate. This suggests that the perfor mance of signal handler installation, in fact, depends on the L1 cache speed: the 386 has no L1 cache, so its performance is halved compared to the 486; the 486 requires two stall cycles to access data in its L1 cache compared to the Pentium's one cycle, accounting for the factor of two performance gain in the Pentium class. This hy pothesis is confirmed by source code analysis, profiling, and anal ysis with the Pentium hardware counters: the signal handler installation system call spends the bulk of its time copying small, easily-cacheable data structures to and from user space. The only mystery that remains, then, is the Pro-200 result, which is 27% worse than would be expected based on cycle time alone, and 42% worse than would be predicted based on the Pro-200's L1 cache bandwidth. Without the underlying low-level tests that a full de composition might offer, we have no way to understand this anom alous result, and can only speculate that for some reason, perhaps when the OS switches from user to kernel mode, some internal CPU state (such as the branch target buffers) is being flushed, or else the CPU is incurring more unnecessary cache-coherency overhead. Even the Pentium Pro hardware performance-monitoring counters do not shed light on this bizarre result.
Thus, in the case of signal handler installation, we see that we can obtain generally useful results even when hbench:OS does not include the capability to decompose the performance of the inter esting high-level functionality into lower-level primitives. Here, we can conclude that lower-latency L1 caches are the key to improving signal handler installation performance. However, we are left at a loss when anomalous results (such as the Pro-200's) appear, since we have no lower-level tests to use as a basis for understanding the unexpected results.
When we applied our hbench:OS-based analysis techniques to NetBSD, we were surprised to find that, for nearly all high-level OS primitives upon which modern applications depend, it is the mem ory system, and especially the access path to the off-chip memory system, that dominates performance. Particularly intriguing were the results from the Pro-200, the Pentium Pro-based machine. De spite major improvements to the processor's execution pipeline and cache subsystem compared to the Pentium, the Pentium Pro did not significantly outperform the Endeavor-based Pentiums on many of the tests. In fact, the addition of multiprocessor coherency support and transaction-based bus protocols into the CPU, and the resulting poor external memory system bandwidth, seem to have essentially negated any performance advantage that the CPU's advanced exe cution core provides. Essentially, Intel's multiprocessor optimiza tions have crippled the performance in single-CPU systems.
Also illuminating was the comparison between the Endeavor- and Premiere-based Pentiums; the Endeavor, with pipeline-burst cache and EDO support, outperformed the Premiere system by nearly a factor of two in many cases, MHz for MHz. While re searchers have known for several years that a high-performance memory subsystem is important to OS performance [1][12][13], it seems that, at least for the x86 architecture, the industry's focus on the processor's pipeline and cache subsystem has been misdirected. For example, Intel's high-end Pentium server motherboard, the Xx press, eschews the advantages of EDO or synchronous DRAM for a large 1MB L2 cache since the larger cache produces higher SPECmark ratings [7]; our results indicate that to improve the per formance of OS-dependent server applications, Intel would have done better by engineering a higher-performance EDO DRAM- based system than by focusing on the caches.
High-bandwidth OS-dependent applications have become common--witness the explosion in the number of web servers run ning today. It is time for vendors to turn their focus from SPEC marks and application-level performance to OS performance. Tools such as the hbench:OS suite offer a solid foundation for evaluating OS performance, and can bring to light surprising facts about the design of today's systems, as we have seen in our analysis of Net BSD on the x86 platform. It is tools such as these that will provide the understanding of the architectural dependence of OS perfor mance necessary to build tomorrow's high-performance hardware.
[1] Anderson, T., Dahlin, M., Neefe, J., Patterson, D., Roselli, D., Wang, R., "Serverless Network File Systems," Proceedings of the Fifteenth Symposium on Operating System Principles, Copper Mountain, CO, December 1995, 109-126.
[2] Brown, A., "A Decompositional Approach to Performance Evaluation," Technical Report TR-03-97, Center for Research in Computing Technology, Harvard University, 1997.
[3] Chen, B., Bershad, B., "The Impact of Operating System Structure on Memory System Performance," Proceedings of the Fourteenth ACM Symposium on Operating System Principles, Asheville, NC, December 1993, 120-133.
[4] Chen, B., Endo, Y., Chan, K., Mazieres, D., Dias, A., Seltzer, M., Smith. M., "The Measured Performance of Personal Computer Operating Systems," Proceedings of the Fifteenth Symposium on Operating System Principles, Copper Mountain, CO, December 1995, 299-313.
[5] Computer Systems Research Group, University of Califor nia, Berkeley, "4.4BSD-Lite CD-ROM Companion," O'Reilly and Associates, 1st Edition, June 1994.
[6] Gloy, N., Young, C., Chen, J., Smith, M., "An Analysis of Dynamic Branch Prediction Schemes on System Workloads," Pro ceedings of the International Symposium on Computer Architec ture, May 1996.
[7] Gwennap, L., "Pentium PC Performance Varies Widely," Microprocessor Report, 2 October 1995, 14-15.
[8] Intel Corporation, Pentium Pro Family Developer's Manual, Volume 1: Specifications, Order number 242690-001, Mt. Pros pect, IL, 1996.
[9] Intel Corporation, Pentium Pro Family Developer's Manual, Volume 3: Operating System Writer's Manual, Order number 242692-001, Mt. Prospect, IL, 1996, Appendix B.
[10] Maynard, A., Donnelly, C., Olszewski, B., "Contrasting Characteristics and Cache Performance of Technical and Multi- User Commercial Workloads," Proceedings of the Sixth Interna tional Conference on Architecture Support for Programming Lan guages and Operating Systems, San Jose, CA, October 1994, 145-157.
[11] McVoy, L., Staelin, C., "lmbench: Portable Tools for Perfor mance Analysis," Proceedings of the 1996 USENIX Technical Conference, San Diego, CA, January 1996, 279-295.
[12] Ousterhout, "Why Aren't Operating Systems Getting Faster As Fast as Hardware?" Proceedings of the 1990 Summer USENIX Technical Conference, Anaheim, CA, June 1990, 247-256.
[13] Rosenblum, M., Bugnion, E., Herrod, S., Witchel, E., Gupta, A., "The Impact of Architectural Trends on Operating System Per formance," Proceedings of the Fifteenth Symposium on Operating System Principles, Copper Mountain, CO, December 1995, 285-298.