13. Intermittent Network Multicast Loss

We have had to troubleshoot some tricky cases of intermittent multicast loss in network equipment while deploying the 29West LBM messaging product. Such problems can manifest themselves as spikes in latency while LBM repairs the loss (see Section 17.3).

Troubleshooting intermittent loss is a different type of problem than establishing basic multicast connectivity. Network hardware vendors such as Cisco provide troubleshooting advice for connectivity problems, but don't seem to offer much advice for troubleshooting intermittent problems. Notes on our experiences follow.

The root cause of intermittent multicast loss is generally a failing in a mechanism designed to limit the flow of multicast traffic so that it reaches only interested receivers. Examples of such mechanisms include IGMP snooping, CGMP, and PIM. Although they operate in different ways and at different layers, they all depend on receiving multicast interest information from receivers.

IGMP is used by multicast routers to maintain interest information for hosts on networks to which they are directly connected. (See the Wikipedia article on IGMP for more details.) PIM-SM or another multicast routing protocol must proxy interest information for receivers reachable only through other multicast routers. Such mechanisms use timers to clean up the interests of receivers that leave the network without explicitly revoking their interest. Problems with the settings of these timers can cause intermittent multicast loss. Similarly, interoperability problems between different implementations of these mechanisms can cause temporary lapses in multicast connectivity.

Simple tests can help to diagnose such problems and localize the cause. It's best to begin by adding multicast receivers at key points in the network topology. A receiver process added on the same machine as the multicast source can make sure that the source is at least attempting to send the data. It may also be helpful to add a simple hub or other network tap before the multicast source reaches the first switch in the network to confirm that packets aren't being lost before they leave the NIC. Logging multicast traffic with accurate time stamps and comparing logs across different monitoring points on the network can help isolate the network component that's causing the intermittent loss.

We used such techniques to identify an IGMP snooping interoperability problem between 3Com, Dell, and Cisco switches on a customer network. The customer chose to work around the problem by simply disabling IGMP snooping on all the switches. This was a reasonable choice since the receiver density across switch ports was fairly high and their multicast rates were quite low so that IGMP snooping wasn't adding much value on their network.

We've also seen a case where an operating system mysteriously stopped answering IGMP membership queries for interested processes running on the machine. It did continue to send an initial join message when a process started and a leave message when a process exited. An IGMP snooping switch heard the initial join message, started forwarding traffic, but later stopped forwarding it after a timeout had lapsed without hearing additional membership reports from the port. Rebooting the operating system fixed the problem.

Similar symptoms may appear if IGMP snooping switches are used without a multicast router or other source of IGMP membership query messages. Many modern switches that are capable of IGMP snooping can also act as an IGMP querier. Such a feature must be used if a LAN doesn't have a multicast router and you want the selective forwarding benefit of IGMP snooping. If IGMP snooping is enabled without a querier on the LAN, then intermittent multicast connectivity is likely to result.

Copyright 2004 - 2008 29West, Inc.