Stale TCP connections with tg3 on Red Hat Enterprise Linux 6

00시 28분 2014년 4월 2일 업데이트

문제

  • Stale TCP connections
  • Connection are being reset unexpectedly
  • SSH disconnect just after login
  • Page allocation failure messages seen in /var/log/messages:
Jan  9 06:36:35 xxxxxxx kernel: swapper: page allocation failure. order:4, mode:0x20
Jan  9 06:36:35 xxxxxxx kernel: Pid: 0, comm: swapper Not tainted 2.6.32-358.14.1.el6.x86_64 #1

환경

  • Red Hat Enterprise Linux 6
    • Broadcom tg3 NIC driver <= 3.132 [3.132 tg3 driver is included in RHEL 6 update 5]
    • GSO/TSO on

해결

  • This issue is being tracked via private Bugzilla #1029192, it does not currently have a resolution. If this issue is encountered please contact Red Hat Support.

Workaround

  • Disable TSO/GSO on all interfaces using the tg3 NIC driver:
# ethtool -K <interface> gso off tso off

근본 원인

  • The tg3 driver has a workaround for a hardware issue on some models that may cause some packets to be reallocated during TX. The problem is when this happens with an GSO/TSO offloaded packet (>= MTU size) and when system memory is heavily fragmented. When all this aligns up, as this reallocation happens under softirq context and the allocation has to be atomic, the allocation may fail and the driver will not be able to pass down this big packet to the network card itself.
  • Therefore, if this allocation fails, the packet will be silently dropped, just like a drop at network. But when TCP tries to retransmit, it will likely hit the same allocation failure, leading to a connection stall.
  • Disabling GSO/TSO will avoid such large packets at tg3 TX path, and should alleviate the situation.

진단 단계

  • Check dmesg output for traces containing:
Nov  8 07:59:25 SERVER kernel: swapper: page allocation failure. order:3, mode:0x20
Nov  8 07:59:25 SERVER kernel: Pid: 0, comm: swapper Not tainted 2.6.32-358.18.1.el6.x86_64 #1
Nov  8 07:59:25 SERVER kernel: Call Trace:
Nov  8 07:59:25 SERVER kernel: <IRQ>  [<ffffffff8112c257>] ? __alloc_pages_nodemask+0x757/0x8d0
Nov  8 07:59:25 SERVER kernel: [<ffffffff81166d92>] ? kmem_getpages+0x62/0x170
Nov  8 07:59:25 SERVER kernel: [<ffffffff811679aa>] ? fallback_alloc+0x1ba/0x270
Nov  8 07:59:25 SERVER kernel: [<ffffffff811673ff>] ? cache_grow+0x2cf/0x320
Nov  8 07:59:25 SERVER kernel: [<ffffffff81167729>] ? ____cache_alloc_node+0x99/0x160
Nov  8 07:59:25 SERVER kernel: [<ffffffff811688f0>] ? kmem_cache_alloc_node_trace+0x90/0x200
Nov  8 07:59:25 SERVER kernel: [<ffffffff81168b0d>] ? __kmalloc_node+0x4d/0x60
Nov  8 07:59:25 SERVER kernel: [<ffffffff8143ddad>] ? __alloc_skb+0x6d/0x190
Nov  8 07:59:25 SERVER kernel: [<ffffffff8143eed0>] ? skb_copy+0x40/0xb0                            <---
Nov  8 07:59:25 SERVER kernel: [<ffffffffa012627c>] ? tg3_start_xmit+0xa8c/0xd50 [tg3]              <---
Nov  8 07:59:25 SERVER kernel: [<ffffffff814493b8>] ? dev_hard_start_xmit+0x308/0x530
Nov  8 07:59:25 SERVER kernel: [<ffffffffa01d2d5b>] ? bond_start_xmit+0x53b/0x5d0 [bonding]
Nov  8 07:59:25 SERVER kernel: [<ffffffff8146773a>] ? sch_direct_xmit+0x15a/0x1c0
Nov  8 07:59:25 SERVER kernel: [<ffffffff814493b8>] ? dev_hard_start_xmit+0x308/0x530
Nov  8 07:59:25 SERVER kernel: [<ffffffff8144d0c0>] ? dev_queue_xmit+0x3b0/0x550
Nov  8 07:59:25 SERVER kernel: [<ffffffffa01d2687>] ? bond_dev_queue_xmit+0x67/0x200 [bonding]
Nov  8 07:59:25 SERVER kernel: [<ffffffffa01d2d5b>] ? bond_start_xmit+0x53b/0x5d0 [bonding]
Nov  8 07:59:25 SERVER kernel: [<ffffffff814493b8>] ? dev_hard_start_xmit+0x308/0x530
Nov  8 07:59:25 SERVER kernel: [<ffffffff8143ddc1>] ? __alloc_skb+0x81/0x190
Nov  8 07:59:25 SERVER kernel: [<ffffffff8144cf15>] ? dev_queue_xmit+0x205/0x550
Nov  8 07:59:25 SERVER kernel: [<ffffffff814857f8>] ? ip_finish_output+0x148/0x310
Nov  8 07:59:25 SERVER kernel: [<ffffffff81485a78>] ? ip_output+0xb8/0xc0
Nov  8 07:59:25 SERVER kernel: [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
Nov  8 07:59:25 SERVER kernel: [<ffffffff81484d75>] ? ip_local_out+0x25/0x30
Nov  8 07:59:25 SERVER kernel: [<ffffffff81485250>] ? ip_queue_xmit+0x190/0x420
Nov  8 07:59:25 SERVER kernel: [<ffffffff8149a04e>] ? tcp_transmit_skb+0x40e/0x7b0
Nov  8 07:59:25 SERVER kernel: [<ffffffff8149c45b>] ? tcp_write_xmit+0x1fb/0xa20
Nov  8 07:59:25 SERVER kernel: [<ffffffff8149ce10>] ? __tcp_push_pending_frames+0x30/0xe0
Nov  8 07:59:25 SERVER kernel: [<ffffffff814948a3>] ? tcp_data_snd_check+0x33/0x100
Nov  8 07:59:25 SERVER kernel: [<ffffffff81498471>] ? tcp_rcv_established+0x371/0x800
Nov  8 07:59:25 SERVER kernel: [<ffffffffa004e609>] ? do_hpsa_intr_msi+0x149/0x290 [hpsa]
Nov  8 07:59:25 SERVER kernel: [<ffffffff814a04e3>] ? tcp_v4_do_rcv+0x2e3/0x430
Nov  8 07:59:25 SERVER kernel: [<ffffffff810e1760>] ? handle_IRQ_event+0x60/0x170
Nov  8 07:59:25 SERVER kernel: [<ffffffff814a1d6e>] ? tcp_v4_rcv+0x4fe/0x8d0
Nov  8 07:59:25 SERVER kernel: [<ffffffff8143d6d7>] ? __kfree_skb+0x47/0xa0
Nov  8 07:59:25 SERVER kernel: [<ffffffff8147f9fd>] ? ip_local_deliver_finish+0xdd/0x2d0
Nov  8 07:59:25 SERVER kernel: [<ffffffff8147fc88>] ? ip_local_deliver+0x98/0xa0
Nov  8 07:59:25 SERVER kernel: [<ffffffff8147f14d>] ? ip_rcv_finish+0x12d/0x440
Nov  8 07:59:25 SERVER kernel: [<ffffffff8147f6d5>] ? ip_rcv+0x275/0x350
Nov  8 07:59:25 SERVER kernel: [<ffffffff814488ab>] ? __netif_receive_skb+0x4ab/0x750
Nov  8 07:59:25 SERVER kernel: [<ffffffff8149f08a>] ? tcp4_gro_receive+0x5a/0xd0
Nov  8 07:59:25 SERVER kernel: [<ffffffff8144ac88>] ? netif_receive_skb+0x58/0x60
Nov  8 07:59:25 SERVER kernel: [<ffffffff8144ad90>] ? napi_skb_finish+0x50/0x70
Nov  8 07:59:25 SERVER kernel: [<ffffffff8144d339>] ? napi_gro_receive+0x39/0x50
Nov  8 07:59:25 SERVER kernel: [<ffffffffa0122c14>] ? tg3_poll_work+0x784/0xe50 [tg3]
Nov  8 07:59:25 SERVER kernel: [<ffffffffa012332c>] ? tg3_poll_msix+0x4c/0x150 [tg3]
Nov  8 07:59:25 SERVER kernel: [<ffffffff8144d453>] ? net_rx_action+0x103/0x2f0
Nov  8 07:59:25 SERVER kernel: [<ffffffff810770b1>] ? __do_softirq+0xc1/0x1e0
Nov  8 07:59:25 SERVER kernel: [<ffffffff810e1760>] ? handle_IRQ_event+0x60/0x170
Nov  8 07:59:25 SERVER kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
Nov  8 07:59:25 SERVER kernel: [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
Nov  8 07:59:25 SERVER kernel: [<ffffffff81076e95>] ? irq_exit+0x85/0x90
Nov  8 07:59:25 SERVER kernel: [<ffffffff815176e5>] ? do_IRQ+0x75/0xf0
Nov  8 07:59:25 SERVER kernel: [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
Nov  8 07:59:25 SERVER kernel: <EOI>  [<ffffffff812d3cae>] ? intel_idle+0xde/0x170
Nov  8 07:59:25 SERVER kernel: [<ffffffff812d3c91>] ? intel_idle+0xc1/0x170
Nov  8 07:59:25 SERVER kernel: [<ffffffff814155f7>] ? cpuidle_idle_call+0xa7/0x140
Nov  8 07:59:25 SERVER kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
Nov  8 07:59:25 SERVER kernel: [<ffffffff8150756c>] ? start_secondary+0x2ac/0x2ef
Nov  8 08:04:37 SERVER kernel: swapper: page allocation failure. order:3, mode:0x20
Nov  8 08:04:37 SERVER kernel: Pid: 0, comm: swapper Not tainted 2.6.32-358.18.1.el6.x86_64 #1

 

+ Recent posts