6. Monitoring Messaging Systems

Production deployments of messaging systems often employ real-time monitoring and rapid human intervention if something goes wrong. Urgent reaction to problems detected by real-time monitoring can be required to prevent the messaging system from becoming unstable. A common example of this is using real-time monitoring to discover crybaby receivers and repair or remove them from the network. (See Section 3.1 for details.) The presence of a crybaby receiver can starve lossless receivers of bandwidth needed for new messages in what is commonly called a NAK storm.

29West has designed LBM so that real-time monitoring and urgent response are not required. The design of LBM encourages stable operation by allowing you to pre-configure how LBM will use resources under all traffic and network conditions. Hence manual intervention is not required when those conditions occur.

Monitoring LBM still fills important roles other than maintaining stable operation. Chiefly among these are capacity planning and gaining a better understanding the latency that LBM adds to recover from loss. Collecting accumulated statistics from all sources and all receivers once per day is generally adequate for these purposes.

Copyright 2004 - 2008 29West, Inc.