本文简要说明如何分析Android Native Service 死锁的方法,并说明bionic和glibc在mutex上的实现差异。

当Android Native Service中用mutex保护资源出现竞争,导致死锁时,我们可以用gdb attach Native service,通过观察Mute状态找出deadlock chain.

操作方法 链接到标题

1. 找到Native service pid 链接到标题

假设Native service的process名为test执行

ps -A | grep test

将会看到pid为359

root 359 1 226924 9496 binder_thread_read 0 S

2. 用gdb attach到test进程 链接到标题

gdb -p 359 

3. 对所有的thread执行bt 链接到标题

attach后,process会被暂停,通过下面命令会列出test进程中所有线程的调用栈

(gdb) thread apply all bt

观察每个调用栈,看是否有线程在等待mutex,例如可以找到

#0  0xe9f0d22c in syscall () from /apex/com.android.runtime/lib/bionic/libc.so
(gdb) bt
#0  0xe9f0d22c in syscall () from /apex/com.android.runtime/lib/bionic/libc.so
#1  0xe9f12392 in __futex_wait_ex(void volatile*, bool, int, bool, timespec const*) () from /apex/com.android.runtime/lib/bionic/libc.so
#2  0xe9f5ad12 in NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*) () from /apex/com.android.runtime/lib/bionic/libc.so
....
#11 0xe751e404 in TaskWrapFunc (arg=0xe82a0018 <g_stTask+56>)
#12 0xe9f5a12c in __pthread_start(void*) ()
   from /apex/com.android.runtime/lib/bionic/libc.so
#13 0xe9f12fee in __start_thread ()
   from /apex/com.android.runtime/lib/bionic/libc.so
(gdb)

现在我们知道线程TaskWrapFunc拿不到mutex被锁住,下一步我们看如何知道是哪个线程拿走了mutex

4. 导入带符号的libc 链接到标题

因为板子上Android的libc并没有携带符号,所以我们无法查看frame中各个参数和变量的信息 将编译目录的out\target\product\xxx\symbols\apex\com.android.runtime\ 拷贝到Android板子的/data分区下。保持目录结构/data/symbols/apex/com.android.runtime/,在gdb中导入有符号的libc

(gdb) set solib-absolute-prefix /data/symbols
(gdb) set solib-search-path /data/symbols

5. 查看mutex owner 链接到标题

再做bt时,可以看到所有frame的信息

(gdb) bt
#0  syscall () at bionic/libc/arch-arm/bionic/syscall.S:44
#1  0xe9f12392 in __futex (ftx=0xe98989f0, op=137, value=301301762, 
    timeout=0x0, bitset=-1) at bionic/libc/private/bionic_futex.h:45
#2  FutexWithTimeout (ftx=<optimized out>, op=137, value=<optimized out>, 
    use_realtime_clock=<optimized out>, abs_timeout=<optimized out>, 
    bitset=-1) at bionic/libc/bionic/bionic_futex.cpp:58
#3  __futex_wait_ex (ftx=0xe98989f0, shared=<optimized out>, value=301301762, 
    use_realtime_clock=<optimized out>, abs_timeout=0x0)
    at bionic/libc/bionic/bionic_futex.cpp:63
#4  0xe9f5ad12 in NonPI::RecursiveOrErrorcheckMutexWait (mutex=0xe98989f0, 
    shared=0, old_state=<optimized out>, use_realtime_clock=false, 
    abs_timeout=0x0) at bionic/libc/bionic/pthread_mutex.cpp:705
#5  NonPI::MutexLockWithTimeout (mutex=0xe98989f0, use_realtime_clock=false, 
    abs_timeout_or_null=0x0) at bionic/libc/bionic/pthread_mutex.cpp:784
#6  0xe751c94c in TD_OS_MutexLock (pMutex=0xe98989f0)
...
#14 0xe751e404 in TaskWrapFunc (arg=0xe82a0018 <g_stTask+56>)
    at /home/cd00010/vestel_mp_idtv/vestel_n33007_mb250_mp_idtv/TV/code/platform/src/system/os/td_os.c:2827
#15 0xe9f5a12c in __pthread_start (arg=0xea0121c0)
--Type <RET> for more, q to quit, c to continue without paging--
    at bionic/libc/bionic/pthread_create.cpp:347
#16 0xe9f12fee in __start_thread (fn=0xe9f5a103 <__pthread_start(void*)>, 
    arg=<optimized out>) at bionic/libc/bionic/clone.cpp:53

执行下面命令切换frame,并查看其等待的mutex的owner为tid等于4597的线程

(gdb) f 4
#4  0xe9f5ad12 in NonPI::RecursiveOrErrorcheckMutexWait (mutex=0xe98989f0, 
    shared=0, old_state=<optimized out>, use_realtime_clock=false, 
    abs_timeout=0x0) at bionic/libc/bionic/pthread_mutex.cpp:705
705     bionic/libc/bionic/pthread_mutex.cpp: No such file or directory.
(gdb) p *mutex
$3 = {state = 32770, {owner_tid = 4597, pi_mutex_id = 4597}}

6. 在前面all thread的信息中tid=4597线程找到owner线程 链接到标题

Thread 70 (LWP 4597):

切换到拿到mutex的线程,并通过bt查看流程,可以看到拿了Mutex后就一直sleep,结合代码就可以找出原因了

(gdb) t 70
[Switching to thread 70 (LWP 4597)]
#0  nanosleep ()
    at out/soong/.intermediates/bionic/libc/syscalls-arm.S/gen/syscalls-arm.S:1822
1822    out/soong/.intermediates/bionic/libc/syscalls-arm.S/gen/syscalls-arm.S: No such file or directory.

问题处理 链接到标题

在前面第5步看到的owner_tid为0,这是因为Android的bionic libc和glibc的实现不一样,bionic只会在mutex type不为PTHREAD_MUTEX_NORMAL记录owner_tid。

解决方法 链接到标题

在初始化mutex的时候为其设置PTHREAD_MUTEX_ERRORCHECK

    pthread_mutexattr_t attr;
    pthread_mutexattr_init(&attr);
    pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_ERRORCHECK);
    pthread_mutex_init( pstMtx, &attr );
    pthread_mutexattr_destroy(&attr);

流程分析 链接到标题

Android pthread实现在bionic/libc/bionic/中,mutex lock会调用到MutexLockWithTimeout 没有设置Type,默认为NORMAL,会直接上锁,然后退出

if ( __predict_true(mtype == MUTEX_TYPE_BITS_NORMAL) ) {
    return NormalMutexLock(mutex, shared, use_realtime_clock, abs_timeout_or_null);
}

设置type后会先检查,如果设置的式error check就不会支持递归锁

pid_t tid = __get_thread()->tid;
if (tid == atomic_load_explicit(&mutex->owner_tid, memory_order_relaxed)) {
    if (mtype == MUTEX_TYPE_BITS_ERRORCHECK) {
        return EDEADLK;
    }
    return RecursiveIncrement(mutex, old_state);
}

在lock的时候保存owner_tid

if (old_state == unlocked) {
    // If exchanged successfully, an acquire fence is required to make
    // all memory accesses made by other threads visible to the current CPU.
    if (__predict_true(atomic_compare_exchange_strong_explicit(&mutex->state, &old_state,
                         locked_uncontended, memory_order_acquire, memory_order_relaxed))) {
        atomic_store_explicit(&mutex->owner_tid, tid, memory_order_relaxed);
        return 0;
    }
}

Error check和递归上锁上锁使用

if (RecursiveOrErrorcheckMutexWait(mutex, shared, old_state, use_realtime_clock,
                                   abs_timeout_or_null) == -ETIMEDOUT) {
    return ETIMEDOUT;
}

参考 链接到标题

http://aospxref.com/android-12.0.0_r3/xref/bionic/libc/bionic/pthread_mutex.cpp