OS/RedHat Bug Report
[BUG] NFS client kernel panic in rpciod
SYPER
2014. 4. 24. 15:24
NFS client kernel panic in rpciod
Issue:
- NFS client kernel crash because async task already queued hitting BUG_ON(RPC_IS_QUEUED(task)); in __rpc_execute
- A second panic is similar in that rpciod thread panics, but at a different place, hitting kernel BUG at kernel/workqueue.c which is the following BUG_ON(get_wq_data(work) != cwq);. Also, prior to the oops, we see some warnings about list corruption, triggered from a __list_add called from xprt_reserve_xprt. Based on the location in the code, the list corruption is being flagged on the rpc_xprt's 'sending' or 'resend' queue
- See this kbase article for more details.
- NFS 4 Client and Server are RHEL 6.2 (kernel 2.6.32-220.el6.x86_64)
A fix is still being developed. Test kernels are available. Please contact your support representative for more information.
Root Cause:
- Because of a race condition or use after free, it is possible the rpc_task.tk_runstate 'RPC_TASK_QUEUED' bit can get set incorrectly on an rpc_task.
- Ultimately one of the following kernel crashes will result:
- rpciod thread crashes with kernel BUG at net/sunrpc/sched.c seen in the log with RIP inside __rpc_execute. The specific BUG_ON is BUG_ON(RPC_IS_QUEUED(task))
- rpciod thread crashes with kernel BUG at kernel/workqueue.c seen in the log with RIP worker_thread. The specific BUG_ON statement is BUG_ON(get_wq_data(work) != cwq
- A kernel crash results because of the corruption of the rpc_task.u union. In the rpc_task.u union corruption instance, simultaneous use of both the 'tk_work' and 'tk_wait' members of the union leads to either a corrupt rpc_wait_queue (the 'tk_work' member is initialized, but the 'tk_wait' member is accessed, often seen as a corrupt rpc_xprt's pending, sending, or resend queue), or a corrupt workqueue_struct (the 'tk_wait' member is initialized, but the 'tk_work' member is accessed, often seen as a corrupt rpciod workqueue_struct).