进程地址空间mmap_sem导致的死锁

问题还原

测试同事在进行vdbench测试出现长时间断流,别的同事没有找到原因,问题最终来到我这里,jira记录没有做故障,后端服务进程正常。

现场没有了,还好保存了crash core,需要分析crash来定位。

crash分析

1. 加载我们的内核模块

1
crash> mod -S .

2. 打印所有进程栈信息,查看可疑的进程栈

1
crash> foreach bt > bt.all

通过查看所有的进程栈,发现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
PID: 28945 TASK: ffff8b93f673af70 CPU: 6 COMMAND: "dataio_srv/0"
#0 [ffff8b93572e3bb8] __schedule at ffffffffb91128d4
#1 [ffff8b93572e3c48] schedule at ffffffffb9112f49
#2 [ffff8b93572e3c58] schedule_timeout at ffffffffb91107f4
#3 [ffff8b93572e3d08] ofs_tjobq_timedwait_locked at ffffffffc10e01b6
#4 [ffff8b93572e3d78] ofs_tjobq_purge at ffffffffc10e042e
#5 [ffff8b93572e3db8] dataio_srv_tcb at ffffffffc0767195
#6 [ffff8b93572e3e18] ofs_tpool_tcb at ffffffffc10dee8f
#7 [ffff8b93572e3e38] ofs_daemon_proxy at ffffffffc10e0e05
#8 [ffff8b93572e3ec8] kthread at ffffffffb8abae31
PID: 28946 TASK: ffff8b93f673bf40 CPU: 4 COMMAND: "dataio_srv/1"
#0 [ffff8b9350eab728] __schedule at ffffffffb91128d4
#1 [ffff8b9350eab7b0] schedule at ffffffffb9112f49
#2 [ffff8b9350eab7c0] rwsem_down_read_failed at ffffffffb911457d
#3 [ffff8b9350eab890] down_read at ffffffffb9112070
#4 [ffff8b9350eab8a8] ffs_get_user_page at ffffffffc06548fb
#5 [ffff8b9350eab908] ffs_copy_to_user at ffffffffc0654b14
#6 [ffff8b9350eab9a8] ffs_get_size at ffffffffc06ca395
#7 [ffff8b9350eaba68] __direct_read_cb at ffffffffc06cb6c5
#8 [ffff8b9350eabb18] dataio_ack at ffffffffc0768c8f
#9 [ffff8b9350eabc38] segread_end at ffffffffc0750abe
#10 [ffff8b9350eabd18] dinode_odc_nr at ffffffffc0766f9e
#11 [ffff8b9350eabdb8] dataio_srv_tcb at ffffffffc07671cf
#12 [ffff8b9350eabe18] ofs_tpool_tcb at ffffffffc10dee8f
#13 [ffff8b9350eabe38] ofs_daemon_proxy at ffffffffc10e0e05
#14 [ffff8b9350eabec8] kthread at ffffffffb8abae31
PID: 28947 TASK: ffff8b93f673cf10 CPU: 12 COMMAND: "dataio_srv/2"
#0 [ffff8b93c9767728] __schedule at ffffffffb91128d4
#1 [ffff8b93c97677b0] schedule at ffffffffb9112f49
#2 [ffff8b93c97677c0] rwsem_down_read_failed at ffffffffb911457d
#3 [ffff8b93c9767890] down_read at ffffffffb9112070
#4 [ffff8b93c97678a8] ffs_get_user_page at ffffffffc06548fb
#5 [ffff8b93c9767908] ffs_copy_to_user at ffffffffc0654b14
#6 [ffff8b93c97679a8] ffs_get_size at ffffffffc06ca395
#7 [ffff8b93c9767a68] __direct_read_cb at ffffffffc06cb6c5
#8 [ffff8b93c9767b18] dataio_ack at ffffffffc0768c8f
#9 [ffff8b93c9767c38] segread_end at ffffffffc0750abe
#10 [ffff8b93c9767d18] dinode_odc_nr at ffffffffc0766f9e
#11 [ffff8b93c9767db8] dataio_srv_tcb at ffffffffc07671cf
#12 [ffff8b93c9767e18] ofs_tpool_tcb at ffffffffc10dee8f
#13 [ffff8b93c9767e38] ofs_daemon_proxy at ffffffffc10e0e05
#14 [ffff8b93c9767ec8] kthread at ffffffffb8abae31
PID: 28948 TASK: ffff8b93f673dee0 CPU: 15 COMMAND: "dataio_srv/3"
#0 [ffff8b9281f8f728] __schedule at ffffffffb91128d4
#1 [ffff8b9281f8f7b0] schedule at ffffffffb9112f49
#2 [ffff8b9281f8f7c0] rwsem_down_read_failed at ffffffffb911457d
#3 [ffff8b9281f8f890] down_read at ffffffffb9112070
#4 [ffff8b9281f8f8a8] ffs_get_user_page at ffffffffc06548fb
#5 [ffff8b9281f8f908] ffs_copy_to_user at ffffffffc0654b14
#6 [ffff8b9281f8f9a8] ffs_get_size at ffffffffc06ca395
#7 [ffff8b9281f8fa68] __direct_read_cb at ffffffffc06cb6c5
#8 [ffff8b9281f8fb18] dataio_ack at ffffffffc0768c8f
#9 [ffff8b9281f8fc38] segread_end at ffffffffc0750abe
#10 [ffff8b9281f8fd18] dinode_odc_nr at ffffffffc0766f9e
#11 [ffff8b9281f8fdb8] dataio_srv_tcb at ffffffffc07671cf
#12 [ffff8b9281f8fe18] ofs_tpool_tcb at ffffffffc10dee8f
#13 [ffff8b9281f8fe38] ofs_daemon_proxy at ffffffffc10e0e05
#14 [ffff8b9281f8fec8] kthread at ffffffffb8abae31

dataio_srv是我们的数据服务线程,发现有三个服务线程都被卡在一个rw_semaphore上了,ffs_get_user_page函数会申请mmap_sem读信号量,分析应该是有进程请求写信号量而之前有读未释放导致。

3. 找到对应的mm_struct地址进入得到mmap_sem的地址为0xffff8b92ab737e20

读写信号量的等待队列结构为struct rwsem_waiter,遍历这个链表即可获得所有等待在这个信号量的task。

1
2
3
4
5
struct rwsem_waiter {
struct list_head list;
struct task_struct *task;
enum rwsem_waiter_type type;
};
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
crash> list rwsem_waiter.list -s rwsem_waiter.task,type -h 0xffff8b92ab737e20
ffff8b92ab737e20
task = 0xffff8b93cdcf8000
type = RWSEM_WAITING_FOR_WRITE
ffff8b9350eab7e0
task = 0xffff8b93f673bf40
type = RWSEM_WAITING_FOR_READ
ffff8b93c97677e0
task = 0xffff8b93f673cf10
type = RWSEM_WAITING_FOR_READ
ffff8b9281f8f7e0
task = 0xffff8b93f673dee0
type = RWSEM_WAITING_FOR_READ
ffff8b93ee2f3df0
task = 0xffff8b92b4332f70
type = RWSEM_WAITING_FOR_READ
ffff8b93f41cbcb0
task = 0xffff8b93ff72eeb0
type = RWSEM_WAITING_FOR_READ
ffff8b93758ebd50
task = 0xffff8b92826d0fd0
type = RWSEM_WAITING_FOR_READ
ffff8b918d43bd50
task = 0xffff8b93d409dee0
type = RWSEM_WAITING_FOR_READ
ffff8b9353bbfd50
task = 0xffff8b91d44cdee0
type = RWSEM_WAITING_FOR_READ
ffff8b93e5337d50
task = 0xffff8b93be258000
type = RWSEM_WAITING_FOR_READ
ffff8b9282d17d80
task = 0xffff8b9307be1fa0
type = RWSEM_WAITING_FOR_READ
ffff8b9378ebfd80
task = 0xffff8b93ea6adee0
type = RWSEM_WAITING_FOR_READ
ffff8b933123bd50
task = 0xffff8b92f9a31fa0
type = RWSEM_WAITING_FOR_READ
ffff8b93e93dfd80
task = 0xffff8b9307be0000
type = RWSEM_WAITING_FOR_READ
ffff8b919f933d80
task = 0xffff8b91aa413f40
type = RWSEM_WAITING_FOR_READ
ffff8b93a9fabd80
task = 0xffff8b938032dee0
type = RWSEM_WAITING_FOR_READ
ffff8b91a347fd80
task = 0xffff8b91e4f51fa0
type = RWSEM_WAITING_FOR_READ
ffff8b92de83bd80
task = 0xffff8b936d065ee0
type = RWSEM_WAITING_FOR_READ
ffff8b92d73b3d80
task = 0xffff8b935b753f40
type = RWSEM_WAITING_FOR_READ

4. 请求写信号量的栈

1
2
3
4
5
6
7
8
9
10
11
12
13
14
crash> p (struct task_struct *)0xffff8b93cdcf8000
$1 = (struct task_struct *) 0xffff8b93cdcf8000
crash> p $1.pid
$2 = 12862
crash> bt 12862
PID: 12862 TASK: ffff8b93cdcf8000 CPU: 0 COMMAND: "java"
#0 [ffff8b92ab737d58] __schedule at ffffffffb91128d4
#1 [ffff8b92ab737de0] schedule at ffffffffb9112f49
#2 [ffff8b92ab737df0] rwsem_down_write_failed at ffffffffb9114835
#3 [ffff8b92ab737e80] call_rwsem_down_write_failed at ffffffffb8d5a7b7
#4 [ffff8b92ab737ec8] down_write at ffffffffb91120bd
#5 [ffff8b92ab737ee0] do_mprotect_pkey at ffffffffb8bcf805
#6 [ffff8b92ab737f40] sys_mprotect at ffffffffb8bcfac3
#7 [ffff8b92ab737f50] system_call_fastpath at ffffffffb911f7d5

5. 查找拿着读信号量的进程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
crash> bt 22134
PID: 22134 TASK: ffff8b91992e3f40 CPU: 13 COMMAND: "java"
#0 [ffff8b91efd5b768] __schedule at ffffffffb91128d4
#1 [ffff8b91efd5b7f8] schedule at ffffffffb9112f49
#2 [ffff8b91efd5b808] lock_blkgrp_sync at ffffffffc06bf512
#3 [ffff8b91efd5b948] wait_on_blkgrp_locked at ffffffffc06c107c
#4 [ffff8b91efd5b998] filemap_read_blkgrp at ffffffffc064d9ae
#5 [ffff8b91efd5ba28] __ffs_do_filemap_fault at ffffffffc064f2d4
#6 [ffff8b91efd5bb88] ffs_do_filemap_fault at ffffffffc064f705
#7 [ffff8b91efd5bbe8] ffs_filemap_fault at ffffffffc064fce3
#8 [ffff8b91efd5bc98] __do_fault at ffffffffb8bc05fa
#9 [ffff8b91efd5bd18] do_read_fault at ffffffffb8bc0b5c
#10 [ffff8b91efd5bd78] handle_pte_fault at ffffffffb8bc52dc
#11 [ffff8b91efd5be10] handle_mm_fault at ffffffffb8bc747d
#12 [ffff8b91efd5beb8] __do_page_fault at ffffffffb911a597
#13 [ffff8b91efd5bf20] do_page_fault at ffffffffb911a925
#14 [ffff8b91efd5bf50] page_fault at ffffffffb9116768

__do_page_fault会拿着mmap_sem的读信号量,然后22134线程会发送读命令给dataio去后端读取数据,但是dataio线程卡在这个信号量了(因为12862请求了些信号量,后续请求读读信号量会排在写信号量之后),这就导致了死锁。