NFS client kernel panic in rpciod

Issue:

  • NFS client kernel crash because async task already queued hitting BUG_ON(RPC_IS_QUEUED(task)); in __rpc_execute
  • A second panic is similar in that rpciod thread panics, but at a different place, hitting kernel BUG at kernel/workqueue.c which is the following BUG_ON(get_wq_data(work) != cwq);. Also, prior to the oops, we see some warnings about list corruption, triggered from a __list_add called from xprt_reserve_xprt. Based on the location in the code, the list corruption is being flagged on the rpc_xprt's 'sending' or 'resend' queue
  • See this kbase article for more details.
Environment:
  • NFS 4 Client and Server are RHEL 6.2 (kernel 2.6.32-220.el6.x86_64)
Resolution:

A fix is still being developed. Test kernels are available. Please contact your support representative for more information.

Root Cause:

  • Because of a race condition or use after free, it is possible the rpc_task.tk_runstate 'RPC_TASK_QUEUED' bit can get set incorrectly on an rpc_task.
  • Ultimately one of the following kernel crashes will result:
    1. rpciod thread crashes with kernel BUG at net/sunrpc/sched.c seen in the log with RIP inside __rpc_execute. The specific BUG_ON is BUG_ON(RPC_IS_QUEUED(task))
    2. rpciod thread crashes with kernel BUG at kernel/workqueue.c seen in the log with RIP worker_thread. The specific BUG_ON statement is BUG_ON(get_wq_data(work) != cwq
    3. A kernel crash results because of the corruption of the rpc_task.u union. In the rpc_task.u union corruption instance, simultaneous use of both the 'tk_work' and 'tk_wait' members of the union leads to either a corrupt rpc_wait_queue (the 'tk_work' member is initialized, but the 'tk_wait' member is accessed, often seen as a corrupt rpc_xprt's pending, sending, or resend queue), or a corrupt workqueue_struct (the 'tk_wait' member is initialized, but the 'tk_work' member is accessed, often seen as a corrupt rpciod workqueue_struct).


Back to top...

 

+ Recent posts