[BUG] NFS client kernel panic in rpciod

OS/RedHat Bug Report

SYPER 2014. 4. 24. 15:24

Issue:

NFS client kernel crash because async task already queued hitting BUG_ON(RPC_IS_QUEUED(task)); in __rpc_execute
A second panic is similar in that rpciod thread panics, but at a different place, hitting kernel BUG at kernel/workqueue.c which is the following BUG_ON(get_wq_data(work) != cwq);. Also, prior to the oops, we see some warnings about list corruption, triggered from a __list_add called from xprt_reserve_xprt. Based on the location in the code, the list corruption is being flagged on the rpc_xprt's 'sending' or 'resend' queue
See this kbase article for more details.

Environment:

Resolution:

A fix is still being developed. Test kernels are available. Please contact your support representative for more information.

Root Cause:

Because of a race condition or use after free, it is possible the rpc_task.tk_runstate 'RPC_TASK_QUEUED' bit can get set incorrectly on an rpc_task.
Ultimately one of the following kernel crashes will result:
1. rpciod thread crashes with kernel BUG at net/sunrpc/sched.c seen in the log with RIP inside __rpc_execute. The specific BUG_ON is BUG_ON(RPC_IS_QUEUED(task))
2. rpciod thread crashes with kernel BUG at kernel/workqueue.c seen in the log with RIP worker_thread. The specific BUG_ON statement is BUG_ON(get_wq_data(work) != cwq
3. A kernel crash results because of the corruption of the rpc_task.u union. In the rpc_task.u union corruption instance, simultaneous use of both the 'tk_work' and 'tk_wait' members of the union leads to either a corrupt rpc_wait_queue (the 'tk_work' member is initialized, but the 'tk_wait' member is accessed, often seen as a corrupt rpc_xprt's pending, sending, or resend queue), or a corrupt workqueue_struct (the 'tk_wait' member is initialized, but the 'tk_work' member is accessed, often seen as a corrupt rpciod workqueue_struct).