Messaging systems can be monitored to detect unstable operation. When trouble is detected, a human can be alerted to find the cause of the problem and remove it.
A classic example of this is a "crybaby receiver" causing a "NAK storm." A crybaby receiver is simply a receiver enduring so many lost packets that it generates many NAKs (Negative AcKnowledgments) which request retransmission. The retransmissions at least add latency for lossless receivers. In some cases, retransmissions completely prevent even lossless receivers from receiving new messages.
Monitoring can detect a NAK storm and alert a human who can find and fix the crybaby receiver to allow data to once again flow to receivers without loss. We see this as an example of detecting unstable operation and relying on human response to restore stable operation.
We believe it's better to simply establish policies that prevent the system from reaching the point of unstable operation. Automatic policies should be in place that prevent sources from retransmitting so fast that receivers stop receiving new messages. With policies like this in place, human intervention is not required to maintain stable operation.
Such automatic policies do not remove the need for monitoring, but they do reduce the need for prompt human attention and intervention. They effectively shift the focus of monitoring toward capacity planning and forensics. See Section 2 above.
The LBM and UME products from 29West allow policies to be set that limit the rate of retransmission so that the impact of crybaby receivers can be limited. Such policies allow stable operation of your applications even when some receivers are suffering massive losses.
| Prev | Home | Next |
| Test Application Stability with Rising Market Data Rates | Use Uncertainty to Your Advantage |
Copyright 2004 - 2007 29West, Inc.