5. Kernel/User Transitions

In modern operating systems, network packets are transferred to and from the hardware interface by the operating system kernel. Since applications typically run in user space, a transition between user and kernel space is necessary to get message data into and out of the machine. The overhead involved in making this transition can be significant, introducing both message latency and CPU load. Designers of high-performance applications generally try to minimize the number of user/kernel transitions needed to get a message from a sending application to a receiving application.

One fairly easy way to minimize user/kernel transitions is to avoid making multiple system calls per message. For example, an application message may contain an application header followed by a data payload. It may be convenient for the programmer to make two system calls to send the message - first to send the header and then to send the payload. This is especially true if different parts of the code are responsible for header and data generation. Similarly, the receiving application may find it useful to first read the header, perhaps to get the length of the rest of the message, followed by a second read of the proper number of bytes. However, for high-performance applications, the header and payload should be combined by the application and given to the kernel with a single system call. The receiving application should create a receive buffer large enough for the largest possible message and do a single read. These steps improve both CPU and network efficiency, leading to greater overall throughput.

When senders and receivers are on different machines, a given message must cross the user/kernel boundary a minimum of two times. Many high-performance applications seek to further improve efficiency using batching to combine multiple application messages and send them to the kernel as a single buffer. This can reduce the average number of kernel/user transitions per message to less than one. However, for latency-sensitive applications, message batching is often undesirable since it can introduce more latency than the user/kernel transitions it eliminates (see Section 2 for exceptions).

Server/daemon-based messaging systems introduce additional user/kernel transitions. For example, TIBCO® RV requires that a sending application send a message first to a daemon process on the same machine, then to a corresponding daemon on the receiving machine, before it is sent to the receiving application. If batching is suppressed, this leads to a minimum of six user/kernel transitions per message. TIBCO SmartSockets, as well most JMS-based systems, don't use daemon processes, but do use a central server, leading to a minimum of four user/kernel transitions per message.

When using messaging systems that rely on daemons and/or servers, it may be useful to utilize some form of batching and experimentally determine what level of batching provides the smallest average latency.

The 29West LBM product buffers entire application messages to minimize the number of operating system write calls. In addition, it does not use daemons or central servers; messages are sent directly from sending application to receiving application. These measures lead to the best-case of two user/kernel transitions per message when batching is turned off. (For designers who need to emphasize throughput and efficiency over latency, batching can be enabled to further reduce the user/kernel transitions, often to less than one per message.)

Copyright 2005 - 2006 29West, Inc. -- 29West Confidential