SandStorm I/O Core Benchmarks

Matt Welsh, Harvard University
Last updated 7 Jan 2001

This page documents a set of benchmarks demonstrating the scalability of a Java-based server using nonblocking I/O mechanisms. Here, we compare the use of thread-based concurrency with nonblocking I/O, using the Java NBIO library.

For more information please see the following pages:

Note: These results are somewhat outdated. More recent analysis is provided in our SOSP'01 paper.

Basic server setup: The server in question accepts socket connections from clients; for each burst of 1000 8192-byte packets it sends out a short 32 byte ack. The server and clients simply measure the bandwidth of the connection.

There are seven implementations of the server:

The server and all clients are 4-way 500 Mhz Pentium III machines running Linux 2.2.15 with IBM JDK 1.1.8. All nodes are connected with Gigabit Ethernet.

This graph shows the aggregate bandwidth measured by the server as the number of connections grows from 0 to 1000. Some things to note about this:

First, the NBIO-based servers all sustain good throughput even out to a large number of connections. When using poll(2), there is some penalty as the number of connections increases, which is no doubt due to the overhead of the Linux poll(2) implementation -- see below. When using /dev/poll instead, the graph is essentially flat out to 1000 connections!

The threaded servers cannot run beyond 400 and 450 connections, respectively. This is because each server requires at 1 thread per socket connection, but Linux has a per-user limit of 512 processes (and IBM "native threads" are actually processes in Linux). In fact, to run abot 256 threads, one needs to run as root and set ulimit -u unlimited. Even so, we can see that the threaded server performance starts to degrade as the number of connections increases.

The aSocket servers perform worse than the "raw" NBIO and threaded servers. This is not surprising, since aSocket provides a nice level of abstraction on top of "raw" sockets. In particular, aSocket is responsible for allocating a new byte array for each incoming packet and passing it up to the user; this increases memory pressure as well as garbage collection overhead. However, it is more general in the sense that the user need not manage incoming buffer space for packets.

The NBIO servers exhibit lower performance for a small number of connections than their threaded counterparts. This is also not surprising: the overhead of using a small number of threads is low compared to setting up event-handling loops and using the NBIO SelectSet mechanism to test for incoming events. However, NBIO is not optimized for small numbers of connections -- if you only have a few connections, then you might as well be using threads! Sustaining high performance for a large number of connections is the goal.

The comparison between SelectSource and raw NBIO is meant to show that the overhead for event queue randomization is low.

Scaling to 10,000 Connections

This graph shows the aggregate bandwidth measured by the server as the number of connections grows from 0 to 10,000. Note that only the /dev/poll-based aSocket server was measured with 10,000 connections, although all of the NBIO-based servers could support the load.

At 10,000 connections the server obtains an aggregate bandwidth of 101 Mbps, which while lower than the peak bandwidth of 161 Mbps (for 100 connections) is still very good. The performance of the poll(2)-based and threaded aSocket servers are shown for comparison. Note that the x-axis is using a log scale here.

Effect of Idle Connections on poll(2) Performance

It seems that when using poll(2), NBIO performance degrades as the number of connections grows; Linux's poll system call has known problems with a large number of sockets. To quantify this effect we reran the raw NBIO server with just 1 "active" connection, but many "idle" connections (which opened a socket but send and receive no packets). We measure the bandwidth of just the one active connection as the number of idle connections grows.

The active client and the server were both connected using Gigabit Ethernet. Both the poll(2) and /dev/poll implementations of NBIO were run.

This graph clearly shows that for the poll(2)-based NBIO implementation, as the number of idle connections increases, bandwidth reduces as the inverse of the number of connections. This suggests that the overhead to test for incoming events across all sockets grows linearly, which makes sense.

The /dev/poll NBIO implementation does have some bandwidth degradation as the number of idle connections increases, however, this is much less than we see in the poll(2) case. The advantage of /dev/poll is that it acts as a true "event queue" up to the application: Checking the queue for incoming events is a constant-time operation, not linear in the number of sockets. I believe that the Linux implementation of /dev/poll must still test across many sockets within the kernel, leading to limited throughput degradation.

Back to SandStorm Index