背景:
从测试同学那边拿过来两台黑屏的机器,点击电源、屏幕都没有反应,不过还好adb可以链接,有root权限,一番deubgging后,确认表面原因是system_server进程的art虚拟机卡在了gc前flip线程状态的操作中,而最终的root cause也比较有意思,所以本文就简单记录下问题的排查过程。
1. 按照习惯,快速检查下是否发生了watchdog
1 | $ adb shell ``ls` `-l ``/data/anr/ |
并没有watchdog相关日志
2. 查看系统当前时间
1 | `$ adb shell ``date``Sun May 5 13:52:25 CST 2019` |
3. 查看system_server进程各个线程的状态
1 | $ pid=`adb shell pidof system_server` && adb shell ``ps` `-T $pid |
这一步主要是快速检查下有没有D(disk blocked)状态的线程
3. 查看system_server进程backtrace
1 | $ pid=`adb shell pidof system_server` && adb shell ``kill` `-3 $pid |
得到:
点击展开完整backtrace!
1 | ----- pid 1471 at 2019-05-05 11:07:04 ----- |
检查各个线程在虚拟机内部的状态后,看到一个比较奇怪的现象:
1 | "Binder:filter-perf-event"` `prio=``5` `tid=``4` `WaitingForGcThreadFlip |
一堆线程都是gc相关的状态,具体什么意思,我也不关心,还是再继续看下blocked的线程,都具体卡在什么地方:
1 | "Binder:1471_6" prio=5 tid=106 Blocked |
接着看 0x05e68352 和 0x00c6c1f5 这俩monitor锁的持有线程:
1 | "Binder:1471_20" prio=5 tid=146 WaitingForGcThreadFlip |
可以看到还是owner线程正是之前 WaitingForGcThreadFlip 中的一员,那么下一步通过gdb看下这一类线程具体卡在哪一行代码上:
- 根据机器的版本号确认symbol文件后,通过公司同事分享的脚本一行命令就可以attach到system_server进程上
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62$ ~/Documents/gdb_native_tools/droid.py attach /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols system_server
process: system_server, pid(1471), parent(zygote64)
gdbserver:/data/local/tmp/gdbserver64
bin_path: /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/
solib: /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/lib64/
bin: /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/app_process64
Attached; pid = 1471
gdbserver: Unable to determine the number of hardware watchpoints available.
gdbserver: Unable to determine the number of hardware breakpoints available.
Listening on port 1234
GNU gdb (GDB) 7.11
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Reading symbols from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/app_process64...done.
Remote debugging using localhost:1234
Remote debugging from host 127.0.0.1
warning: .dynamic section for "/home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/app_process64" is not at the expected address (wrong library or version mismatch?)
warning: .dynamic section for "/home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/lib64/libandroid_runtime.so" is not at the expected address (wrong library or version mismatch?)
warning: .dynamic section for "/home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/lib64/libcutils.so" is not at the expected address (wrong library or version mismatch?)
warning: .dynamic section for "/home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/lib64/libaudioclient.so" is not at the expected address (wrong library or version mismatch?)
warning: .dynamic section for "/home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/lib64/libcutils.so" is not at the expected address (wrong library or version mismatch?)
warning: Could not load shared library symbols for 15 libraries, e.g. /system/lib64/libclang_rt.ubsan_standalone-aarch64-android.so.
Use the "info sharedlibrary" command to see the complete listing.
Do you need "set solib-search-path" or "set sysroot"?
Reading symbols from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/app_process64...done.
Reading symbols from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/linker64...done.
......
syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41
41 bionic/libc/arch-arm64/bionic/syscall.S: No such file or directory.
Source directories searched: /mnt/miui/v10-p-dipper-alpha:$cdir:$cwd
#0 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41
#1 0x0000007238a80560 in art::futex (uaddr=0x723961df68, op=0, val=20935915, timeout=0x7fc723ee90, uaddr2=0x7fc723ee90, val3=0) at art/runtime/base/mutex-inl.h:43
#2 art::ConditionVariable::WaitHoldingLocks (this=0x723961df58, self=0x7239614c00) at art/runtime/base/mutex.cc:953
#3 0x0000007238d6241c in art::Monitor::Lock<(art::LockReason)1> (this=0x723961df20, self=0x7239614c00) at art/runtime/monitor.cc:559
#4 0x0000007238d687a8 in art::Monitor::MonitorEnter (self=0x7239614c00, obj=<optimized out>, trylock=<optimized out>) at art/runtime/monitor.cc:1462
#5 0x0000007238ec40fc in art::mirror::Object::MonitorEnter (this=0x62 <_DYNAMIC+98>, self=0x7239614c00) at art/runtime/mirror/object-inl.h:98
#6 artLockObjectFromCode (obj=0x62 <_DYNAMIC+98>, self=0x7239614c00) at art/runtime/entrypoints/quick/quick_lock_entrypoints.cc:32
#7 0x0000007238f04e9c in art_quick_lock_object () at art/runtime/arch/arm64/quick_entrypoints_arm64.S:1510
#8 0x00000072207cf6a4 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further
/home/pip/Documents/gdb_native_tools/gdb-script/init.gdb: No such file or directory.
/home/pip/Documents/gdb_native_tools/shadow/gdb_driver.py: No such file or directory.
(gdb) handle SIG35 nostop noprint pass
Signal Stop Print Pass to program Description
SIG35 No No Yes Real-time event 35
(gdb) handle SIG33 nostop noprint pass
Signal Stop Print Pass to program Description
SIG33 No No Yes Real-time event 33
tips:handle SIG35 nostop noprint pass 是为了不处理android平台上用于抓native trace的 sig35 信号,减小我们的调试过程意外中断的可能性(只是个习惯, 这个问题中其实不做也没关系)
- 切换到gc线程
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39(gdb) thread 9
[Switching to thread 9 (Thread 1471.1484)]
#0 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41
41 svc #0
(gdb) bt
#0 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41
#1 0x0000007238a80560 in art::futex (uaddr=0x7239723910, op=0, val=93152, timeout=0x7221a89ab0, uaddr2=0x7221a89ab0, val3=0) at art/runtime/base/mutex-inl.h:43
#2 art::ConditionVariable::WaitHoldingLocks (this=0x7239723900, self=0x723966b000) at art/runtime/base/mutex.cc:953
#3 0x0000007238b98d54 in art::gc::Heap::ThreadFlipBegin (this=0x72396b7600, self=0x723966b000) at art/runtime/gc/heap.cc:845 // <<<<<<<<<<<<<<<<<<<<<<<
#4 0x0000007238e52204 in art::ThreadList::FlipThreadRoots (this=<optimized out>, thread_flip_visitor=0x7221a89ca0, flip_callback=<optimized out>, collector=
0x7239716500, pause_listener=<optimized out>) at art/runtime/thread_list.cc:591
#5 0x0000007238b65cc4 in art::gc::collector::ConcurrentCopying::FlipThreadRoots (this=0x7239716500) at art/runtime/gc/collector/concurrent_copying.cc:620
#6 0x0000007238b64c50 in art::gc::collector::ConcurrentCopying::RunPhases (this=0x7239716500) at art/runtime/gc/collector/concurrent_copying.cc:178
#7 0x0000007238b7af80 in art::gc::collector::GarbageCollector::Run (this=0x7239716500, gc_cause=<optimized out>, clear_soft_references=<optimized out>)
at art/runtime/gc/collector/garbage_collector.cc:96
#8 0x0000007238b9d93c in art::gc::Heap::CollectGarbageInternal (this=0x72396b7600, gc_type=art::gc::collector::kGcTypeFull, gc_cause=art::gc::kGcCauseBackground,
clear_soft_references=false) at art/runtime/gc/heap.cc:2616
#9 0x0000007238baeeb8 in art::gc::Heap::ConcurrentGC (this=0x72396b7600, self=<optimized out>, cause=art::gc::kGcCauseBackground, force_full=false)
at art/runtime/gc/heap.cc:3621
#10 0x0000007238bb4524 in art::gc::Heap::ConcurrentGCTask::Run (this=<optimized out>, self=0x0 <_DYNAMIC>) at art/runtime/gc/heap.cc:3582
#11 0x0000007238bd692c in art::gc::TaskProcessor::RunAllTasks (this=0x72396bb100, self=0x723966b000) at art/runtime/gc/task_processor.cc:129
#12 0x0000000072a7e3f0 in dalvik.system.VMRuntime.clampGrowthLimit [DEDUPED] ()
from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/framework/arm64/boot-core-libart.oat
#13 0x0000000072c20bd8 in java.lang.Daemons$HeapTaskDaemon.runInternal ()
from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/framework/arm64/boot-core-libart.oat
#14 0x0000000072a80c40 in java.lang.Daemons$Daemon.run ()
from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/framework/arm64/boot-core-libart.oat
#15 0x0000000071baa4dc in java.lang.Thread.run ()
from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/framework/arm64/boot-core-oj.oat
#16 0x0000007238f0498c in art_quick_invoke_stub () at art/runtime/arch/arm64/quick_entrypoints_arm64.S:1702
#17 0x0000007238a78524 in art::ArtMethod::Invoke (this=0x6ff68ad8, self=0x723966b000, args=<optimized out>, args_size=<optimized out>, result=0x7221a8a310,
shorty=0x7269daa9 "V") at art/runtime/art_method.cc:374
#18 0x0000007238e07258 in art::(anonymous namespace)::InvokeWithArgArray (soa=..., method=0x6ff68ad8, arg_array=0x7221a8a318, result=<optimized out>,
shorty=0x7269daa9 "V") at art/runtime/reflection.cc:456
#19 0x0000007238e08314 in art::InvokeVirtualOrInterfaceWithJValues (soa=..., obj=0x5 <_DYNAMIC+5>, mid=<optimized out>, args=0x0 <_DYNAMIC>)
at art/runtime/reflection.cc:548
#20 0x0000007238e36764 in art::Thread::CreateCallback (arg=0x723966b000) at art/runtime/thread.cc:473
#21 0x00000072bb501db0 in __pthread_start (arg=0x7221a8a4f0) at bionic/libc/bionic/pthread_create.cpp:254
#22 0x00000072bb4a378c in __start_thread (fn=0x72bb501d88 <__pthread_start(void*)>, arg=0x7221a8a4f0) at bionic/libc/bionic/clone.cpp:52
结合代码: art/runtime/gc/heap.cc#845
1 | void Heap::ThreadFlipBegin(Thread* self) { |
既然在等待一个condition,那么 disable_thread_flip_count_ 肯定是大于0的,我们可以打印一下这个变量的值:
1 | (gdb) print disable_thread_flip_count_ |
果然它是1,正常情况下,当前线程在thread_flip_cond_这个条件上wait一段时间后,肯定会有一个线程会去notify它,来唤醒处于wait状态的线程;
- 那么,接下来自然是快速过一下源码,看看都有哪些函数会去操作thread_flip_cond_:
1 | void Heap::ThreadFlipEnd(Thread* self) { |
再看一下各个线程的backtrace,确认了,没有哪个在调用上面两个任意一个
- 那些处于WaitingForGcThreadFlip的线程block在哪?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16(gdb) t 23
[Switching to thread 23 (Thread 1471.1692)]
#0 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41
41 svc #0
(gdb) bt
#0 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41
#1 0x0000007238a80560 in art::futex (uaddr=0x7239723910, op=0, val=93152, timeout=0x721cbe2850, uaddr2=0x721cbe2850, val3=0) at art/runtime/base/mutex-inl.h:43
#2 art::ConditionVariable::WaitHoldingLocks (this=0x7239723900, self=0x721d26a000) at art/runtime/base/mutex.cc:953
#3 0x0000007238b98684 in art::gc::Heap::IncrementDisableThreadFlip (this=0x72396b7600, self=0x721d26a000) at art/runtime/gc/heap.cc:798
#4 0x0000007238d2cfe8 in art::JNI::GetStringCritical (env=<optimized out>, java_string=<optimized out>, is_copy=0x0 <_DYNAMIC>) at art/runtime/jni_internal.cc:1887
#5 0x00000072bae3177c in android::sp<android::IGraphicBufferProducer>::~sp (this=0x0 <_DYNAMIC>) at system/core/libutils/include/utils/StrongPointer.h:156
#6 android::nativeCreateScoped (env=<optimized out>, clazz=<optimized out>, surfaceObject=<optimized out>)
at frameworks/base/core/jni/android_view_SurfaceSession.cpp:51
#7 0x00000000748c94ac in android.filterfw.format.ImageFormat.create ()
from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/framework/arm64/boot-framework.oat
#8 0x0000000000000000 in ?? () from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/app_process64
art/runtime/gc/heap.cc:798 对应的代码:
1 | void Heap::IncrementDisableThreadFlip(Thread* self) { |
所以到这儿,我们清楚的一点是,肯定是由于某个线程没有正常调用 DecrementDisableThreadFlip() 函数,导致 disable_thread_flip_count_ 一直不为0;
- disable_thread_flip_count_ 一直不为 0 这个锅应该是谁的?
仔细看一下上面这个函数的代码,不仅全局有个disable_thread_flip_count_, 同时每个线程还有thread local的 disable_thread_flip_count,每次调用IncrementDisableThreadFlip()开始的时候,都会先执行Thread::Current()->IncrementDisableThreadFlipCount(),看下它的代码:1
2
3
4
5// art/runtime/thread.h
uint32_t GetDisableThreadFlipCount() const {
CHECK(kUseReadBarrier);
return tls32_.disable_thread_flip_count;
}
YEAH!下面只需要逐个遍历art里面每个线程,把 tls32_.disable_thread_flip_count > 1的部分打印出来:
稍微修改一下gdb脚本下的 art.gdb脚本里的art_dump_thread_list_base函数:
1 | define art_dump_thread_list |
然后重新导入脚本然后执行这个函数即可:
1 | (gdb) source /home/pip/Documents/gdb_native_tools/art.gdb |
最后那个 AsyncTask #36709 十分可疑,别的线程的disable_flip_count 为1的原因是正在调用 IncrementDisableThreadFlip()函数,OK,接着看下它在执行什么函数:
- 嫌疑人 “AsyncTask #36709” 干了些什么?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25"AsyncTask #36709" prio=5 tid=180 Native
| group="main" sCount=1 dsCount=0 flags=1 obj=0x14d149c8 self=0x7211c94400
| sysTid=4738 nice=0 cgrp=default sched=0/0 handle=0x708411b4f0
| state=S schedstat=( 3931924 16620626 18 ) utm=0 stm=0 core=7 HZ=100
| stack=0x7084018000-0x708401a000 stackSize=1041KB
| held mutexes=
kernel: __switch_to+0xa4/0xfc
kernel: binder_thread_read+0xa78/0x11a0
kernel: binder_ioctl+0x920/0xb08
kernel: do_vfs_ioctl+0xb8/0x8d8
kernel: SyS_ioctl+0x84/0x98
kernel: __sys_trace+0x4c/0x4c
native: #00 pc 000000000006e5bc /system/lib64/libc.so (__ioctl+4)
native: #01 pc 0000000000029544 /system/lib64/libc.so (ioctl+136)
native: #02 pc 000000000001e22c /system/lib64/libhwbinder.so (android::hardware::IPCThreadState::talkWithDriver(bool)+240)
native: #03 pc 000000000001e884 /system/lib64/libhwbinder.so (android::hardware::IPCThreadState::waitForResponse(android::hardware::Parcel*, int*)+324)
native: #04 pc 0000000000012204 /system/lib64/libhwbinder.so (android::hardware::BpHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+312)
native: #05 pc 00000000000c9764 /system/lib64/android.hardware.gnss@1.0.so (android::hardware::gnss::V1_0::BpHwGnssXtra::_hidl_injectXtraData(android::hardware::IInterface*, android::hardware::details::HidlInstrumentor*, android::hardware::hidl_string const&)+244)
native: #06 pc 000000000005d5c4 /system/lib64/libandroid_servers.so (android::android_location_GnssLocationProvider_inject_xtra_data(_JNIEnv*, _jobject*, _jbyteArray*, int)+268)
at com.android.server.location.GnssLocationProvider.native_inject_xtra_data(Native method)
at com.android.server.location.GnssLocationProvider.access$2300(GnssLocationProvider.java:112)
at com.android.server.location.GnssLocationProvider$10.run(GnssLocationProvider.java:1140)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
at java.lang.Thread.run(Thread.java:764)
它在一个jni里执行一个hw binder call,结合下代码:
1 | // frameworks/base/services/core/jni/com_android_server_location_GnssLocationProvider.cpp#1613 |
(不熟悉art实现)一眼看过去,都是正常的jni调用,好像跟disable_flip_count没啥关系呀?全局搜一下DecrementDisableThreadFlip()和IncrementDisableThreadFlip被调用的地方:
1 | ReleaseStringCritical() → DecrementDisableThreadFlip() |
到这个地方,稍稍消化总结一下,不难推断出:
这个问题必然是因为 AsyncTask #36709 在执行jni调用过程中,它执行了:
GetPrimitiveArrayCritical() → IncrementDisableThreadFlip() → 使得全局变量 disable_thread_flip_count_ = 1
但是在继续执行 gnssXtraIface →injectXtraData() 这个 hal 接口的时候,一直被block,自然无法执行 ReleasePrimitiveArrayCritical() → DecrementDisableThreadFlip() → disable_thread_flip_count_ = 0
4. gnssXtraIface→injectXtraData() 为什么会一直被block?
查看binder transactin日志
1 | $ adb shell cat /sys/kernel/debug/binder/transactions |
incoming transaction from 1471:4738 to 1007:0,得到对端进程pid为1007, 继续看下这个进程里各个线程的状态
1 | $ adb shell ps -T 1007 |
各个线程看起来都是正常的,通过debuggerd查看backtrace
1 | $ adb shell debuggerd -b 1007 |
结合上面的binder call日志,可以看到ignss_1_1进程内部的两个binder线程都在回调system_server进程的binder 接口,没有空闲binder线程可以处理来自system_server的gnssXtraIface→injectXtraData() 请求!
incoming 1471:1673
outgoing 1007:1007 to 1471:1673 // blocked !
outgoing 1007:2040 to 1471:8981 // WaitingForGcThreadFlip !
p.s. 1471:1673 并不是一个binder线程, 但确响应了来自1007:1007的binder call,这是binder api设计上面的优化,复用? todo:确认
1 | "android.bg" prio=5 tid=13 Blocked |
5. 修复方案
所以这个问题也是system_server↔ ignss_1_1 这两个进程相互binder call + jni call + 虚拟机正好在执行垃圾回收;
参考 oracle的文档:
After calling GetPrimitiveArrayCritical, the native code should not run for an extended period of time before it calls ReleasePrimitiveArrayCritical. We must treat the code inside this pair of functions as running in a “critical region.” Inside a critical region, native code must not call other JNI functions, or any system call that may cause the current thread to block and wait for another Java thread. (For example, the current thread must not call read on a stream being written by another Java thread.)
These restrictions make it more likely that the native code will obtain an uncopied version of the array, even if the VM does not support pinning. For example, a VM may temporarily disable garbage collection when the native code is holding a pointer to an array obtained via GetPrimitiveArrayCritical.
GetPrimitiveArrayCritical 和 ReleasePrimitiveArrayCritical 两个函数间的代码执行耗时不应太久! 结合此处的业务逻辑,而且也不会被太频繁的调用,拷贝一份原始数组给binder call作为参数即可:
1 | static void android_location_GnssLocationProvider_inject_xtra_data(JNIEnv* env, jobject /* obj */, |