Simulator Implementation



next up previous
Next: Simulator Parameters Up: Experimentation Previous: Simulator Design Goals

Simulator Implementation

Our simulator is divided into three parts; the simulator engine that drove the simulation, the host modules that implemented the caching schemes, and the network module that simulated the Internet.

The engine read in the trace data, and used it to create events. Each event represents a host requesting a World Web document from a server; the actual event consists of the server and host involved, the name of the document, the size of the document, and the date and time of the transaction.

The engine also provides support for periodic jobs that must be run on a frequent basis. When the job is scheduled to run it is removed from the queue and the associated code is run. The code may, as part of its duties, re-insert the job on the queue. This allows us, for example, to reset a host's load every hour, and to snapshot the simulator every six.

There were two different types of host modules: servers and clients. Clients would be handed an event by the engine and would use this information to request a URL from a server. This request would travel through our simulated Internet where appropriate bandwidth would be logged. Servers had all the functionality of clients, but could also serve files. Servers kept track of which files they were caching; since we were most interested in simulating the server's behaviour our simulation also maintained a primary server that always stored the most up-to-date versions of files. Both types of hosts record a variety of load parameters such as number of requests, number of bytes stored, number of bytes sent, cache hits, and stale cache hits.

Depending on the exact scheme we were simulating, the server could either return the request document or redirect the client to another server. Clients could cache not only redirects but also files themselves in order to simulate client-caching. Servers could also push documents onto other servers; a centralized manager kept track of servers willing to accept documents.

The Internet module was the most difficult portion of the simulation to build. We wanted it to log not only the number of packets passing across it, but the exact bandwidth used in terms of , where 1000 bytes sent across 10 network hops would use twice the bandwidth of 1000 bytes sent across 5 network hops. In order to log bandwidth accurately the simulator needed an accurate Internet topology, but it would clearly be impossible to calculate and store the entire Internet topology. Even if we limited the amount of information we needed to the 60,000 or so hosts mentioned in the NCSA access log we would still be faced with an impossible amount of information.

We also did not want to synthesize topology information because push-caching would be very sensitive to the interplay between geographic information and Internet topology. If we used the same assumptions to generate the topology that push-caching used to approximate it we would see artificially high correlations between predicted performance and actual performance, a clear case of simulator bias.

We were fortunate therefore to be given access to the Internet topology gathered for a Network Time Protocol server survey by Guyton and Schwartz [18] and discussed in section gif. Using this topology we were able to accurately measure the number of hops between two arbitrary hosts, and thereby compare the efficiency of using miles to predict topology versus using the actual topology itself. We were unable, however, to accurately measure the number of backbone hops between two hosts or to calculate the latency between two hosts using this data. These are factors that we can only crudely approximate using the results of section gif.

Simulated Internet Topology

 

Of the 6700 hosts in Guyton and Schwartz's database, we were left with 1700 hosts after removing network routers and hosts on networks that span multiple zip-codes. Figure gif displays the hosts graphically, plotting each host's longitude and latitude. It is easy to see the distinctive shape of the United States, as well as the presence of hosts in Alaska and Hawaii.

  
Figure: Geographical Host Locations. Each potential client in our database is represented by a dot, plotted using the longitude and latitude information gathered for each host. Note the distinctive shape of the United States, as well as the presence of clients in Hawaii and Alaska.

Since our results rely so heavily on this data we took several steps to assure its authenticity. A host from Harvard was conveniently included in the study; we calculated the number of network hops from this host on the simulated Internet to all the other hosts on the simulated Internet. We also calculated the number of network hops from a real host at Harvard to all the other hosts on the real Internet. The results are shown in Figure gif.

  
Figure: Simulated Internet vs. Real Internet. Here we have plotted network hops to the simulated hosts against networks hops to the real hosts. The regression fit correlation is , the fitted line's equation is .

The relatively high correlation indicates that the data are similar; the offset on the fitted line indicates that the simulated network hops are lower overall than the real network hops. The differences in the simulated and real networks stem from three causes. First, the simulated data were gathered over a year ago and many of the hosts from the original survey are no longer accessible. Out of the 1700 simulated hosts we were unable to contact 125 real hosts. Secondly, we did not have access to the actual machine at Harvard present in the simulated topology. We had to settle for running our traces from a Harvard machine that is one hop farther away from the backbone than the simulated machine. Thirdly, the real Internet does not always perform optimal routing: the path that a packet takes between two hosts is not always the shortest path. We can not replicate this routing exactly in our simulated Internet-our packets always traveled the shortest route. These differences are non-trivial, but will not skew our results since all of our caching schemes will use the same topology information.

Since our simulated Internet contains only 1700 hosts and our traces contain up to 50,000 hosts, we needed to create a mapping from real hosts to simulated hosts. Requests in a simulated run appear to be initiated from the mapped host, not from the original host. In computing this mapping we therefore try to maintain geographical information as much as possible. We first try to map a host onto a simulated host in the same subnetwork of the real host. For example, any requests from hosts on the 128.103 subnet will be mapped to a simulated host also on the 123.103 subnet if possible. If such a mapping is not possible then we try to map the host onto a randomly selected host in the same state. This is possible for all states except Vermont or Wyoming, because no hosts in the topology were located in either of those two states. For requests from those states we choose a mapping completely at random.

We did not initially maintain host mappings between simulation runs. Several of our early results showed excessive noise, however, and we realized that this randomization in the host mapping was skewing our results. We therefore set up the mapping such that once a host is mapped it remains mapped for all simulations. This was a valuable lesson because it showed us the sensitivity with which host-mapping affects network bandwidth consumption. We expect therefore that any parameter that affects server selection will affect bandwidth consumption.

Implementation Details

The simulator was written in C++ to simplify the implementation of its modules. Each module was written as a C++ class, and we took advantage of inheritance to simplify the substitution of cache-scheme dependent operations. As an example, the network always hands its messages off to a CHost object. Since the server and client classes inherit their interface from the host class, different types of hosts can be substituted into the network without having to change any network code.

This approach is also efficient because scheme-independent code need only be written once in the generic module, and then inherited modules can call it. This applies, for example, to the instrumentation built into all the host classes to monitor load, etc.

To reduce the simulator's memory requirements we used a disk-backed database to store information such as which clients are caching which documents, the simulated Internet topology, and the host mapping.

We have also built into the simulator extensive logging tools that were used for both debugging as well as gathering results. The debug logs enabled us to test the operation of the simulator by running with contrived test cases in order to verify push-cache operation.



next up previous
Next: Simulator Parameters Up: Experimentation Previous: Simulator Design Goals



James Gwertzman
Wed Apr 12 00:26:11 EDT 1995