It is quite likely that members of a multicast group will experience different loss patterns. Group members who experienced loss will be interested in retransmission of the lost data while members who have already received it will not. In many ways, this presents conflicts similar to those involved in choosing the best sending rate for the group (see Section 3). The conflicts are particularly pronounced when loss is experienced by one or a small group of receivers (see Section 3.1).
It's important to note how small group loss differs from the case where the packet loss is experienced by many receivers in the group. When loss is widespread, there is widespread benefit from the consequent multicast retransmissions. When loss is isolated to one or a few receivers, they benefit from the retransmissions, but the other receivers may experience latency while the sender retransmits. In the common case where there is a limited amount of network bandwidth between the sender and receivers, the bandwidth needed for retransmissions must be subtracted from that available for new data transmission. For example, the 29West LBT-RM reliable multicast protocol always prioritizes retransmissions ahead of new data, so they add a small amount of latency when there there is new data waiting to be sent to all members of the group.
Reliable delivery to a multicast group in the face of loss involves a trade off between throughput/latency and reliable reception for all members. Ideally, administrative policy should establish the boundaries within which reliability will be maintained. A simple and effective way to establish a boundary is to limit the amount of bandwidth that will be used for retransmissions. Establishing such a boundary can effectively defend a group of receivers against an "attack" from a crybaby receiver. No matter how much loss is experienced across the receiver set, the sender will limit the retransmission rate to be within the boundary set by administrative policy.
Note that limiting the retransmission request rate on receivers might be better than doing nothing, but it's not as effective as limiting the bandwidth available for retransmission. For example, if a large number of receivers experience loss, then the combined retransmission request rate could be unacceptably high, even if each individual receiver limits its own retransmission request rate.
Even if retransmission rates have been limited, it is still important to identify the cause of isolated receiver loss problems and repair them. Usually, such loss is caused by overrunning NIC buffers or UDP socket buffers. Assuming that the sender cannot be slowed down, receiver loss can generally avoided by one or more of these means:
Increasing NIC ring buffer size (e.g. use a brand name, server-class NIC instead of a generic, workstation-class NIC, also see Section 18.2)
Decreasing the OS NIC interrupt service latency (e.g. decrease CPU workload or add more CPUs to a multi-CPU machine)
Increasing UDP socket buffer size (see Section 8.8)
Decreasing the OS process context switching time (e.g. decrease CPU workload or add more CPUs to a multi-CPU machine)
Establishing a bandwidth limit on retransmissions will not help an isolated receiver experiencing loss, but it can be a critical factor in ensuring that one receiver does not take down the whole group with excessive retransmission requests. Retransmission rate limits are likely to increase the number of unrecoverable losses on receivers experiencing loss. Still, it's generally best for the group to first establish a defense against future crybaby receivers before working to fix any individual receiver problems.
Copyright 2004 - 2008 29West, Inc.