一次多进程信号量同步失效的排查实录

一次多进程信号量同步失效的排查实录

本文主要是记录笔者在进行无名信号量的代码调试时遇到的两个进程没有通过同一个共享的信号量同步的日志记录

问题现象

笔者写了一个示例代码如下:

#include <semaphore.h>
#include <stdio.h>
#include <sys/mman.h>
#include <sys/stat.h>        /* For mode constants */
#include <fcntl.h>           /* For O_* constants */
#include <unistd.h>

#define SHM_NAME "/wjbtest"

int main(int argc,char **argv)
{
    int fd = 0;
    int ret = 0;
    printf("PID : %d\r\n",getpid());

    fd = shm_open(SHM_NAME,O_CREAT | O_RDWR,0640);

    if(ftruncate(fd,sizeof(sem_t)) != 0)
    {
        perror("ftruncate failed:");
        //解除共享内存空间
        shm_unlink(SHM_NAME);
        return -1;
    }

    //获取共享内存文件相关属性信息,这里获取的是文件大小,查看是否是设置的大小
    struct stat filestat;
    fstat(fd, &filestat);
    printf("shm st_size :%ld\n",filestat.st_size);

    sem_t *sem_ptr = (sem_t *)mmap(NULL,sizeof(sem_t),PROT_READ | PROT_WRITE,MAP_SHARED,fd,0);
    if(sem_ptr == NULL)
    {
        perror("mmap failed :");
        return -1;
    }

    printf("sem_ptr : %x\r\n",sem_ptr);

    close(fd);

    //进行信号量的初始化
    ret = sem_init(sem_ptr,1,1);
    if(ret != 0)
    {
        perror("sem_init failed :");
        return -1;
    }
    printf("sem init succ\r\n");

    sem_wait(sem_ptr);
    printf("sem wait succ\r\n");

    sleep(180);

    sem_post(sem_ptr);
    printf("sem post succ\r\n");

    return 0;
}

这个代码使用一块共享内存共享无名信号量,第一次执行程序对信号量进行初始化然后首先wait将信号量-1,随后进入sleepsleep结束之后进行post将信号量+1。在第一次程序运行过程中第二次执行程序,理论上第二次程序应该一直在wait阻塞知道第一次程序完成post

但是实际现象为第二个程序直接wait并且进行后续操作,没有被阻塞

jhliu@test:~/OS_code/semCode/unnamedSem$ ./gdbmySem &
[1] 3975137
jhliu@test:~/OS_code/semCode/unnamedSem$ PID : 3975137
shm st_size :32
sem_ptr : 1c049000
sem init succ
sem wait succ

jhliu@test:~/OS_code/semCode/unnamedSem$ ./gdbmySem &
[2] 3975162
jhliu@test:~/OS_code/semCode/unnamedSem$ PID : 3975162
shm st_size :32
sem_ptr : ad6ca000
sem init succ
sem wait succ

jhliu@test:~/OS_code/semCode/unnamedSem$

问题的原因其实很简单就是在第二次程序运行时也对同一块内存空间的信号量进行了初始化但是这里记录笔者还不知道这个原因进行的调试步骤

调试过程

strace查看系统调用

首先我是用strace命令查看系统调用

# 终端 1:启动第一个进程,跟踪并分文件输出日志
strace -f -o ./proc1.log ./gdbmySem &

# 终端 2:启动第二个进程(等第一个进入 sleep 后再执行)
strace -f -o ./proc2.log ./gdbmySem &

先查看他们是否使用的同一块共享内存

jhliu@test:~/OS_code/semCode/unnamedSem$ cat ./proc1.log | grep  openat
3865113 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
3865113 openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
3865113 openat(AT_FDCWD, "/dev/shm/wjbtest", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0640) = 3
jhliu@test:~/OS_code/semCode/unnamedSem$ cat ./proc2.log | grep  openat
3865214 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
3865214 openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
3865214 openat(AT_FDCWD, "/dev/shm/wjbtest", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0640) = 3

可以看到这里两个程序·都是打开的同一个共享内存/dev/shm/wjbtest

接下来我想要查看sem_init和sem_wait的系统调用是什么,查看sem_t的值是否是预期的值

POSIX 信号量在 Linux 上通常基于 futex 实现。sem_init 内部会初始化一个 futex 字,sem_wait/post 会触发 futex 系统调用。

但是实际使用发现strace 日志中没有 futex 记录,询问AI可能原因是:

  1. sem_wait 的“快路径”优化(最核心原因),sem_wait 并不是每次调用都进入内核
  2. sem_init() 在现代 Linux/glibc 中根本不产生任何系统调用

所以这里决定使用gdb进行验证调试

gdb调试程序

将程序编译为调试版本:

gcc -g -O0 -o gdbmySem mySem.cpp 

然后先运行第一次程序,等待它进入sleep后查看堆栈信息,使用gdb -p命令

root@test:/home/jhliu/OS_code/semCode/unnamedSem# ./gdbmySem &
[1] 3914304
root@test:/home/jhliu/OS_code/semCode/unnamedSem# PID : 3914304
shm st_size :32
sem_ptr : 60c3000
sem init succ
sem wait succ
root@test:/home/jhliu/OS_code/semCode/unnamedSem# gdb -p 3914304
Attaching to process 3914304
Reading symbols from /home/jhliu/OS_code/semCode/unnamedSem/gdbmySem...
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...
Reading symbols from /usr/lib/debug/.build-id/91/f01b4ad171c80b6303d08d1f08cba8b990413d.debug...
Reading symbols from /lib64/ld-linux-x86-64.so.2...
Reading symbols from /usr/lib/debug/.build-id/91/0350ade28cd94b8877b329819847fbc04f9bd9.debug...

(gdb) bt
#0  __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0,
    req=req@entry=0x7ffd68f38e70, rem=rem@entry=0x7ffd68f38e70)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:71
#1  0x0000708d05f01597 in __GI___nanosleep (req=req@entry=0x7ffd68f38e70,
    rem=rem@entry=0x7ffd68f38e70) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  0x0000708d05f18581 in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x00005953a3eab4e9 in main (argc=1, argv=0x7ffd68f39098) at mySem.cpp:54
(gdb) frame 3

先使用bt查看堆栈信息,现在在sleep的循环语句中,先切换到main函数内,使用frame 3

在main函数栈中打印sem_ptr信息

#3  0x00005953a3eab4e9 in main (argc=1, argv=0x7ffd68f39098) at mySem.cpp:54
54          sleep(180);
(gdb) print sem_ptr
$1 = (sem_t *) 0x708d060c3000
(gdb) print *(unsigned int *)sem_ptr
$2 = 0

这里可以看到打印出来的sem_ptr值为0,说明wait操作确实将sem_ptr减一了

并且可以查看程序底层共享内存映射

(gdb) info proc mappings
process 3914304
Mapped address spaces:

          Start Addr           End Addr       Size     Offset  Perms  objfile
      0x5953a3eaa000     0x5953a3eab000     0x1000        0x0  r--p /home/jhliu/OS_code/semCode/unnamedSem/gdbmySem
      0x5953a3eab000     0x5953a3eac000     0x1000     0x1000  r-xp   /home/jhliu/OS_code/semCode/unnamedSem/gdbmySem
      0x5953a3eac000     0x5953a3ead000     0x1000     0x2000  r--p   /home/jhliu/OS_code/semCode/unnamedSem/gdbmySem
      0x5953a3ead000     0x5953a3eae000     0x1000     0x2000  r--p   /home/jhliu/OS_code/semCode/unnamedSem/gdbmySem
      0x5953a3eae000     0x5953a3eaf000     0x1000     0x3000  rw-p   /home/jhliu/OS_code/semCode/unnamedSem/gdbmySem
      0x5953a8f0f000     0x5953a8f30000    0x21000        0x0  rw-p   [heap]
      0x708d05e00000     0x708d05e28000    0x28000        0x0  r--p   /usr/lib/x86_64-linux-gnu/libc.so.6
      0x708d05e28000     0x708d05fbe000   0x196000    0x28000  r-xp   /usr/lib/x86_64-linux-gnu/libc.so.6
      0x708d05fbe000     0x708d0600d000    0x4f000   0x1be000  r--p   /usr/lib/x86_64-linux-gnu/libc.so.6
      0x708d0600d000     0x708d06011000     0x4000   0x20c000  r--p   /usr/lib/x86_64-linux-gnu/libc.so.6
      0x708d06011000     0x708d06013000     0x2000   0x210000  rw-p   /usr/lib/x86_64-linux-gnu/libc.so.6
      0x708d06013000     0x708d06020000     0xd000        0x0  rw-p
      0x708d060b2000     0x708d060b5000     0x3000        0x0  rw-p
      0x708d060c3000     0x708d060c4000     0x1000        0x0  rw-s   /dev/shm/wjbtest
      0x708d060c4000     0x708d060c6000     0x2000        0x0  rw-p
      0x708d060c6000     0x708d060ca000     0x4000        0x0  r--p   [vvar]
      0x708d060ca000     0x708d060cc000     0x2000        0x0  r-xp   [vdso]
      0x708d060cc000     0x708d060cd000     0x1000        0x0  r--p   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
      0x708d060cd000     0x708d060fa000    0x2d000     0x1000  r-xp   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
      0x708d060fa000     0x708d06104000     0xa000    0x2e000  r--p   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
      0x708d06104000     0x708d06106000     0x2000    0x38000  r--p   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
      0x708d06106000     0x708d06108000     0x2000    0x3a000  rw-p   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
      0x7ffd68f1b000     0x7ffd68f3c000    0x21000        0x0  rw-p   [stack]
  0xffffffffff600000 0xffffffffff601000     0x1000        0x0  --xp   [vsyscall]

紧接着直接使用gdb运行第二次程序,在程序中增加断点,主要是sem_initsem_wait两个地方,用于查看初始化之前和之后两次sem_ptr的值

root@test:/home/jhliu/OS_code/semCode/unnamedSem# gdb ./gdbmySem
GNU gdb (Ubuntu 15.1-1ubuntu2) 15.1 
Reading symbols from ./gdbmySem...
(gdb) set breakpoint pending on
(gdb) break sem_init
Breakpoint 1 at 0x11c0
(gdb) break sem_wait
Breakpoint 2 at 0x1130

然后运行程序,程序会在断点处停下,这时候我们首先查看init之前的值:

(gdb) run
Starting program: /home/jhliu/OS_code/semCode/unnamedSem/gdbmySem 
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
PID : 3930267
shm st_size :32
sem_ptr : f7fba000

Breakpoint 1, __new_sem_init (sem=0x7ffff7fba000, pshared=1, value=1)
    at ./nptl/sem_init.c:28
warning: 28     ./nptl/sem_init.c: No such file or directory
(gdb) bt
#0  __new_sem_init (sem=0x7ffff7fba000, pshared=1, value=1) at ./nptl/sem_init.c:28
#1  0x000055555555548d in main (argc=1, argv=0x7fffffffdf88) at mySem.cpp:43
(gdb) frame 1
#1  0x000055555555548d in main (argc=1, argv=0x7fffffffdf88) at mySem.cpp:43
43          ret = sem_init(sem_ptr,1,1);
(gdb) print sem_ptr
$1 = (sem_t *) 0x7ffff7fba000
(gdb) print *(unsigned int *)0x7ffff7fba000
$2 = 0 

在这里init之前看到sem_ptr的值是0,符合前面第一次程序wait之后将值减一

然后我们继续运行程序

(gdb) frame 0
#0  __new_sem_init (sem=0x7ffff7fba000, pshared=1, value=1) at ./nptl/sem_init.c:28
warning: 28     ./nptl/sem_init.c: No such file or directory
(gdb) continue
Continuing.
sem init succ

Breakpoint 2, __new_sem_wait (sem=0x7ffff7fba000) at ./nptl/sem_wait.c:25
warning: 25     ./nptl/sem_wait.c: No such file or directory
(gdb) bt
#0  __new_sem_wait (sem=0x7ffff7fba000) at ./nptl/sem_wait.c:25
#1  0x00005555555554d0 in main (argc=1, argv=0x7fffffffdf88) at mySem.cpp:51
(gdb) print *(unsigned int *)0x7ffff7fba000
$3 = 1
(gdb) continue
Continuing.
sem wait succ
sem post succ
sem post succ
[Inferior 1 (process 3930267) exited normally]

程序init之后来到第二个断点wait之前,在这里继续打印出sem_ptr的值,发现这里变成了1,而后继续运行程序就能直接wait进行后续操作

这里变量的虚拟地址在一个程序中是固定的,所以直接打印sem_ptr的虚拟地址

到这里笔者查阅了sem_init函数的定义说明:Initializing a semaphore that has already been initialized results in undefined behavior.,反复调用确实重置sem_t的值

上一篇
下一篇