Notes on the Objectivity/DB AMS protocol

MONARC-99/?? - version 2.2

Notes on the Objectivity/DB AMS protocol

Koen Holtman, CERN/CMS , 24 June 1999

This document compiles some observations on the Objectivity/DB AMS protocol. This document is based on some Objectivity 4.0.2/Solaris tests done by me in Feb 1999. The part about large VArrays over the AMS is based on tests done in June 1999 using Objectivity 5.1/Solaris. Thanks to Eva Arderiu Ribera for information on large VArray tests over the AMS done earlier by Marcin Nowak.
Note that this document does not concern the FTO/DRO option: I tested remote access to unreplicated databases, FTO/DRO may introduce additional protocol complications.

Protocol

Communication between the client and the AMS happens with an RPC protocol over a TCP/IP connection between them. By default this protocol has a timeout of 25 seconds, this default may be a problem in some cases (slow/unreliable network or HPSS interface).

As TCP/IP is the base AMS transport mechanism, all the usual issues with TCP/IP apply to AMS communications. Some machines may have kernel limitations on the number of TCP/IP connections. Also, on high-latency links, the relatively small TCP window size may cause a bottleneck even though the available bandwidth has not been saturated. In that case multiple remote clients, each having their own connection to the AMS, would be needed to saturate the link. Feedback mechanisms inside TCP/IP, which ensure that any one connection takes only its fair share of the available bandwidth, may cause nonlinear behaviour in some cases. On performance issues related to TCP/IP, see the usual literature.

Local vs. remote Access

Objectivity clients access pages in a remote database in the same way as they would access the database if it were local. The order of page reads and writes is dependent on the program that is run, and does not change if the database location is changed.

For reading and writing of large (>64KB) VArray contents, there is a difference between local and remote databases. With a local database, the large VArray contents are read and written in a single filesystem I/O operation, no matter what the VArray size is. When the client reads or writes the contents of a large VArray though the AMS, this happens with a sequence of AMS communications. The client divides the VArray in 64 KB chunks and does one AMS communication for each subsequent chunk. This 64 KB chunk size number is independent of the federation page size (from 2 to 64 KB). If 64 KB is not divisible by the chosen page size, the chunk size may actually be the smaller, to make it a multiple of the page size, I did not check this.

Objectivity clients always bring the whole VArray in memory when the first element is accessed, no matter whether the array is in a local or a remote database.

Writing through the AMS

To write to page (or a part of a large VArray) , the client sends a request message which includes the page contents (large VArray part) to the AMS server. This message is not acknowledged. Writing thus seems to be completely asynchronous, and should therefore not see extra problems when the round trip time between client and AMS grows.

Reading through the AMS

To read a page (or a part of a large VArray), the client sends a small request message (containing the location to be read in some form) to the AMS server, and then waits for a response message from the AMS server, which contains the page (or VArray part). Only after this response has been received, the client will continue with further processing, and this further processing may lead to a new page read request being sent.

I have never seen a client request two or more pages at once: it always requests a single page and waits for it to arrive. This means that reading is synchronous and non-pipelined. One can thus expect extra delays when the round trip time between the client and the AMS grows.

Wide-area reading test

I did a small wide-area test of sequential container reading, with the database file at Caltech and the client at CERN. The page size was 32 KB. Lockserver and federation catalog were at CERN.

At the time (of day) of the test, the CERN-Caltech round trip was 0.25 seconds. Theoretically, looking at the AMS reading protocol and assuming no extra roundtrip overheads due to TCP, reading a 32 KB page will take at least 0.25 seconds, so a single client can never pull more than 32/0.25 = 128 KB/s, regardless of the bandwidth of the link.

In the small test I could pull ~50 KB/s with an Objectivity client reading from a DB file at Caltech. With an ftp client I could pull the file over at ~60 KB/s.

Not multithreaded

It should be noted that the current (tested) version of the AMS is not actually multithreaded. A single machine can only contain a single AMS process. If the network throughput and filesystem throughput of this machine are high, then the bottleneck may be the CPU usage of the AMS process. The single threaded AMS can saturate no more than the equivalent of a single CPU, even on a multi-CPU machine.

Note that it is unknown to me how much CPU power an AMS server actually needs to serve a certain amount of data. I have once seen a machine which used 100% of its single CPU providing AMS services, but I expect that a very large part of the CPU power consumption went to the TCP/IP protocol stack and filesystem code inside the kernel, rather than the user-level AMS code itself.

The lack of multithreadedness means that with the current AMS system, to scale up AMS traffic, one should install an additional (small) machine with an additional (small) filesystem, rather than double the CPU power and filesystem throughput of the existing machine. The AMS processes on different machines do not have to communicate with each other so I expect linear scaling, as long as the network link to these machines is not a bottleneck. Note again that this is the non-FTO/DRO, non-replicated case.

Of course, many small AMS data servers with many small filesystems are more difficult to manage than a few large servers.

Speculation on the BaBar multithreaded solution

People in the BaBar experiment at SLAC are working with Objectivity to overcome the single-threadedness limitation of the AMS. It is unclear to me at this point what the final solution will look like exactly. From what I have been able to determine, the multi-threaded solution will not change the page-level (and VArray chunk-level) AMS read/write protocol described above.

The BaBar solution is likely to allow for multiple AMS processes (or threads) running on a single machine, all providing access to databases on the filesystem(s) of this machine. Each client will connect permanently to one of these AMS processes (or threads) after a negotiation phase which ensures some degree of load balancing between the AMS instances. The multithreaded solution will allow all CPU capacity in the machine to be used for AMS tasks. No fine-grained synchronisation between the different AMS processes or threads running on a machine will be needed, so I expect linear performance scaling if more CPUs and AMS processes are added to the machine, until network or filesystem throughput limits provide a ceiling.

Lockserver protocol

Like the AMS reading protocol, the client-lockserver protocol is also synchronous and non-pipelined, RPC over TCP/IP.