
一次多进程信号量同步失效的排查实录
本文主要是记录笔者在进行无名信号量的代码调试时遇到的两个进程没有通过同一个共享的信号量同步的日志记录
问题现象
笔者写了一个示例代码如下:
#include <semaphore.h>
#include <stdio.h>
#include <sys/mman.h>
#include <sys/stat.h> /* For mode constants */
#include <fcntl.h> /* For O_* constants */
#include <unistd.h>
#define SHM_NAME "/wjbtest"
int main(int argc,char **argv)
{
int fd = 0;
int ret = 0;
printf("PID : %d\r\n",getpid());
fd = shm_open(SHM_NAME,O_CREAT | O_RDWR,0640);
if(ftruncate(fd,sizeof(sem_t)) != 0)
{
perror("ftruncate failed:");
//解除共享内存空间
shm_unlink(SHM_NAME);
return -1;
}
//获取共享内存文件相关属性信息,这里获取的是文件大小,查看是否是设置的大小
struct stat filestat;
fstat(fd, &filestat);
printf("shm st_size :%ld\n",filestat.st_size);
sem_t *sem_ptr = (sem_t *)mmap(NULL,sizeof(sem_t),PROT_READ | PROT_WRITE,MAP_SHARED,fd,0);
if(sem_ptr == NULL)
{
perror("mmap failed :");
return -1;
}
printf("sem_ptr : %x\r\n",sem_ptr);
close(fd);
//进行信号量的初始化
ret = sem_init(sem_ptr,1,1);
if(ret != 0)
{
perror("sem_init failed :");
return -1;
}
printf("sem init succ\r\n");
sem_wait(sem_ptr);
printf("sem wait succ\r\n");
sleep(180);
sem_post(sem_ptr);
printf("sem post succ\r\n");
return 0;
}
这个代码使用一块共享内存共享无名信号量,第一次执行程序对信号量进行初始化然后首先wait将信号量-1,随后进入sleep,sleep结束之后进行post将信号量+1。在第一次程序运行过程中第二次执行程序,理论上第二次程序应该一直在wait阻塞知道第一次程序完成post
但是实际现象为第二个程序直接wait并且进行后续操作,没有被阻塞
jhliu@test:~/OS_code/semCode/unnamedSem$ ./gdbmySem &
[1] 3975137
jhliu@test:~/OS_code/semCode/unnamedSem$ PID : 3975137
shm st_size :32
sem_ptr : 1c049000
sem init succ
sem wait succ
jhliu@test:~/OS_code/semCode/unnamedSem$ ./gdbmySem &
[2] 3975162
jhliu@test:~/OS_code/semCode/unnamedSem$ PID : 3975162
shm st_size :32
sem_ptr : ad6ca000
sem init succ
sem wait succ
jhliu@test:~/OS_code/semCode/unnamedSem$
问题的原因其实很简单就是在第二次程序运行时也对同一块内存空间的信号量进行了初始化,但是这里记录笔者还不知道这个原因进行的调试步骤
调试过程
strace查看系统调用
首先我是用strace命令查看系统调用
# 终端 1:启动第一个进程,跟踪并分文件输出日志
strace -f -o ./proc1.log ./gdbmySem &
# 终端 2:启动第二个进程(等第一个进入 sleep 后再执行)
strace -f -o ./proc2.log ./gdbmySem &
先查看他们是否使用的同一块共享内存
jhliu@test:~/OS_code/semCode/unnamedSem$ cat ./proc1.log | grep openat
3865113 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
3865113 openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
3865113 openat(AT_FDCWD, "/dev/shm/wjbtest", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0640) = 3
jhliu@test:~/OS_code/semCode/unnamedSem$ cat ./proc2.log | grep openat
3865214 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
3865214 openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
3865214 openat(AT_FDCWD, "/dev/shm/wjbtest", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0640) = 3
可以看到这里两个程序·都是打开的同一个共享内存/dev/shm/wjbtest
接下来我想要查看sem_init和sem_wait的系统调用是什么,查看sem_t的值是否是预期的值
POSIX 信号量在 Linux 上通常基于
futex实现。sem_init内部会初始化一个futex字,sem_wait/post会触发futex系统调用。
但是实际使用发现strace 日志中没有 futex 记录,询问AI可能原因是:
sem_wait的“快路径”优化(最核心原因),sem_wait并不是每次调用都进入内核sem_init()在现代 Linux/glibc 中根本不产生任何系统调用
所以这里决定使用gdb进行验证调试
gdb调试程序
将程序编译为调试版本:
gcc -g -O0 -o gdbmySem mySem.cpp
然后先运行第一次程序,等待它进入sleep后查看堆栈信息,使用gdb -p命令
root@test:/home/jhliu/OS_code/semCode/unnamedSem# ./gdbmySem &
[1] 3914304
root@test:/home/jhliu/OS_code/semCode/unnamedSem# PID : 3914304
shm st_size :32
sem_ptr : 60c3000
sem init succ
sem wait succ
root@test:/home/jhliu/OS_code/semCode/unnamedSem# gdb -p 3914304
Attaching to process 3914304
Reading symbols from /home/jhliu/OS_code/semCode/unnamedSem/gdbmySem...
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...
Reading symbols from /usr/lib/debug/.build-id/91/f01b4ad171c80b6303d08d1f08cba8b990413d.debug...
Reading symbols from /lib64/ld-linux-x86-64.so.2...
Reading symbols from /usr/lib/debug/.build-id/91/0350ade28cd94b8877b329819847fbc04f9bd9.debug...
(gdb) bt
#0 __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0,
req=req@entry=0x7ffd68f38e70, rem=rem@entry=0x7ffd68f38e70)
at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:71
#1 0x0000708d05f01597 in __GI___nanosleep (req=req@entry=0x7ffd68f38e70,
rem=rem@entry=0x7ffd68f38e70) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2 0x0000708d05f18581 in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3 0x00005953a3eab4e9 in main (argc=1, argv=0x7ffd68f39098) at mySem.cpp:54
(gdb) frame 3
先使用bt查看堆栈信息,现在在sleep的循环语句中,先切换到main函数内,使用frame 3
在main函数栈中打印sem_ptr信息
#3 0x00005953a3eab4e9 in main (argc=1, argv=0x7ffd68f39098) at mySem.cpp:54
54 sleep(180);
(gdb) print sem_ptr
$1 = (sem_t *) 0x708d060c3000
(gdb) print *(unsigned int *)sem_ptr
$2 = 0
这里可以看到打印出来的sem_ptr值为0,说明wait操作确实将sem_ptr减一了
并且可以查看程序底层共享内存映射
(gdb) info proc mappings
process 3914304
Mapped address spaces:
Start Addr End Addr Size Offset Perms objfile
0x5953a3eaa000 0x5953a3eab000 0x1000 0x0 r--p /home/jhliu/OS_code/semCode/unnamedSem/gdbmySem
0x5953a3eab000 0x5953a3eac000 0x1000 0x1000 r-xp /home/jhliu/OS_code/semCode/unnamedSem/gdbmySem
0x5953a3eac000 0x5953a3ead000 0x1000 0x2000 r--p /home/jhliu/OS_code/semCode/unnamedSem/gdbmySem
0x5953a3ead000 0x5953a3eae000 0x1000 0x2000 r--p /home/jhliu/OS_code/semCode/unnamedSem/gdbmySem
0x5953a3eae000 0x5953a3eaf000 0x1000 0x3000 rw-p /home/jhliu/OS_code/semCode/unnamedSem/gdbmySem
0x5953a8f0f000 0x5953a8f30000 0x21000 0x0 rw-p [heap]
0x708d05e00000 0x708d05e28000 0x28000 0x0 r--p /usr/lib/x86_64-linux-gnu/libc.so.6
0x708d05e28000 0x708d05fbe000 0x196000 0x28000 r-xp /usr/lib/x86_64-linux-gnu/libc.so.6
0x708d05fbe000 0x708d0600d000 0x4f000 0x1be000 r--p /usr/lib/x86_64-linux-gnu/libc.so.6
0x708d0600d000 0x708d06011000 0x4000 0x20c000 r--p /usr/lib/x86_64-linux-gnu/libc.so.6
0x708d06011000 0x708d06013000 0x2000 0x210000 rw-p /usr/lib/x86_64-linux-gnu/libc.so.6
0x708d06013000 0x708d06020000 0xd000 0x0 rw-p
0x708d060b2000 0x708d060b5000 0x3000 0x0 rw-p
0x708d060c3000 0x708d060c4000 0x1000 0x0 rw-s /dev/shm/wjbtest
0x708d060c4000 0x708d060c6000 0x2000 0x0 rw-p
0x708d060c6000 0x708d060ca000 0x4000 0x0 r--p [vvar]
0x708d060ca000 0x708d060cc000 0x2000 0x0 r-xp [vdso]
0x708d060cc000 0x708d060cd000 0x1000 0x0 r--p /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
0x708d060cd000 0x708d060fa000 0x2d000 0x1000 r-xp /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
0x708d060fa000 0x708d06104000 0xa000 0x2e000 r--p /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
0x708d06104000 0x708d06106000 0x2000 0x38000 r--p /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
0x708d06106000 0x708d06108000 0x2000 0x3a000 rw-p /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
0x7ffd68f1b000 0x7ffd68f3c000 0x21000 0x0 rw-p [stack]
0xffffffffff600000 0xffffffffff601000 0x1000 0x0 --xp [vsyscall]
紧接着直接使用gdb运行第二次程序,在程序中增加断点,主要是sem_init和sem_wait两个地方,用于查看初始化之前和之后两次sem_ptr的值
root@test:/home/jhliu/OS_code/semCode/unnamedSem# gdb ./gdbmySem
GNU gdb (Ubuntu 15.1-1ubuntu2) 15.1
Reading symbols from ./gdbmySem...
(gdb) set breakpoint pending on
(gdb) break sem_init
Breakpoint 1 at 0x11c0
(gdb) break sem_wait
Breakpoint 2 at 0x1130
然后运行程序,程序会在断点处停下,这时候我们首先查看init之前的值:
(gdb) run
Starting program: /home/jhliu/OS_code/semCode/unnamedSem/gdbmySem
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
PID : 3930267
shm st_size :32
sem_ptr : f7fba000
Breakpoint 1, __new_sem_init (sem=0x7ffff7fba000, pshared=1, value=1)
at ./nptl/sem_init.c:28
warning: 28 ./nptl/sem_init.c: No such file or directory
(gdb) bt
#0 __new_sem_init (sem=0x7ffff7fba000, pshared=1, value=1) at ./nptl/sem_init.c:28
#1 0x000055555555548d in main (argc=1, argv=0x7fffffffdf88) at mySem.cpp:43
(gdb) frame 1
#1 0x000055555555548d in main (argc=1, argv=0x7fffffffdf88) at mySem.cpp:43
43 ret = sem_init(sem_ptr,1,1);
(gdb) print sem_ptr
$1 = (sem_t *) 0x7ffff7fba000
(gdb) print *(unsigned int *)0x7ffff7fba000
$2 = 0
在这里init之前看到sem_ptr的值是0,符合前面第一次程序wait之后将值减一
然后我们继续运行程序
(gdb) frame 0
#0 __new_sem_init (sem=0x7ffff7fba000, pshared=1, value=1) at ./nptl/sem_init.c:28
warning: 28 ./nptl/sem_init.c: No such file or directory
(gdb) continue
Continuing.
sem init succ
Breakpoint 2, __new_sem_wait (sem=0x7ffff7fba000) at ./nptl/sem_wait.c:25
warning: 25 ./nptl/sem_wait.c: No such file or directory
(gdb) bt
#0 __new_sem_wait (sem=0x7ffff7fba000) at ./nptl/sem_wait.c:25
#1 0x00005555555554d0 in main (argc=1, argv=0x7fffffffdf88) at mySem.cpp:51
(gdb) print *(unsigned int *)0x7ffff7fba000
$3 = 1
(gdb) continue
Continuing.
sem wait succ
sem post succ
sem post succ
[Inferior 1 (process 3930267) exited normally]
程序init之后来到第二个断点wait之前,在这里继续打印出sem_ptr的值,发现这里变成了1,而后继续运行程序就能直接wait进行后续操作
这里变量的虚拟地址在一个程序中是固定的,所以直接打印
sem_ptr的虚拟地址
到这里笔者查阅了sem_init函数的定义说明:Initializing a semaphore that has already been initialized results in undefined behavior.,反复调用确实重置sem_t的值
