Tracking down a XNU printf() precision modifier panic bug
Background
I’m a maintainer of the kernel extension in a project of our company, at the development stage, I noticed that the kernel extension can panic occasionally.
In this article, I’ll analyze the kernel panic and find its root cause. Hopefully, as a kernel extension writer, give you hints of HOWTO analyze a kernel panic report.
Kernel panic info
The original kernel panic report is missing, I wrote a kernel panic validation program to verify the kernel bug.
System info
$ sw_vers
ProductName: Mac OS X
ProductVersion: 10.13.6
BuildVersion: 17G65
$ uname -a
Darwin lynnl.local 17.7.0 Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64 x86_64
Anonymous UUID: ACB34763-3BB9-ECBA-9615-D786D1C65C95
Tue Feb 12 20:23:26 2019
*** Panic Report ***
panic(cpu 1 caller 0xffffff801f78776f): Kernel trap at 0xffffff801f746340, type 14=page fault, registers:
CR0: 0x000000008001003b, CR2: 0xffffff8031990000, CR3: 0x000000002256f000, CR4: 0x00000000003606e0
RAX: 0xffffff8031990000, RBX: 0xffffff80b771b99a, RCX: 0x0000000000000000, RDX: 0x0000000000000030
RSP: 0xffffff80b771b860, RBP: 0xffffff80b771b880, RSI: 0xffffff803198f880, RDI: 0xffffff80b771b990
R8: 0xffffff80b771ba90, R9: 0x0000000000000008, R10: 0x000000000000000a, R11: 0x0000000000000000
R12: 0xffffff80b771b990, R13: 0xffffff7fa1c27fee, R14: 0xffffff803198f880, R15: 0x000000000000000b
RFL: 0x0000000000010286, RIP: 0xffffff801f746340, CS: 0x0000000000000008, SS: 0x0000000000000010
Fault CR2: 0xffffff8031990000, Error code: 0x0000000000000000, Fault CPU: 0x1 VMM, PL: 0, VF: 1
Backtrace (CPU 1), Frame : Return Address
0xffffff80b771b330 : 0xffffff801f66c1c6
0xffffff80b771b380 : 0xffffff801f795274
0xffffff80b771b3c0 : 0xffffff801f787544
0xffffff80b771b430 : 0xffffff801f61e1e0
0xffffff80b771b450 : 0xffffff801f66bc3c
0xffffff80b771b580 : 0xffffff801f66b9fc
0xffffff80b771b5e0 : 0xffffff801f78776f
0xffffff80b771b750 : 0xffffff801f61e1e0
0xffffff80b771b770 : 0xffffff801f746340
0xffffff80b771b880 : 0xffffff801fc40502
0xffffff80b771b8c0 : 0xffffff801fc3faac
0xffffff80b771bcd0 : 0xffffff801f680c61
0xffffff80b771bec0 : 0xffffff801f67fcf5
0xffffff80b771bf20 : 0xffffff7fa1c27d5b
0xffffff80b771bf60 : 0xffffff7fa1c27ac2
0xffffff80b771bfa0 : 0xffffff801f61d557
Kernel Extensions in backtrace:
cn.junkman.kext.xnu_printf_prec_panic(0.0.1)[F2B9A806-51AA-3ADE-903C-1A8DD67EE4A6]@0xffffff7fa1c27000->0xffffff7fa1c29fff
BSD process name corresponding to current thread: kernel_task
Boot args: -v
Mac OS version:
17G65
Kernel version:
Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64
Kernel UUID: 1AE5ACFD-3B6F-3D74-AD52-31F1430DBC6F
Kernel slide: 0x000000001f400000
Kernel text base: 0xffffff801f600000
__HIB text base: 0xffffff801f500000
System model name: VMware7,1 (Mac-E43C1C25D4880AD6)
System uptime in nanoseconds: 135059598058
last loaded kext at 133429153651: cn.junkman.kext.xnu_printf_prec_panic 0000.00.01 (addr 0xffffff7fa1c27000, size 12288)
loaded kexts:
cn.junkman.kext.xnu_printf_prec_panic 0000.00.01
com.vmware.kext.vmhgfs 0806.83.93
com.vmware.kext.vmmemctl 0806.83.93
com.vmware.kext.VMwareGfx 0806.83.93
com.apple.driver.AudioAUUC 1.70
...
Analyze the kernel panic report
The heading Anonymous UUID
seems used to identify each unique kernel panic(need confirmation).
The kernel trapped at address 0xffffff801f78776f
, and trap type is 14(page fault), such fault usually indicate an invalid read/write of an unmapped page of memory, i.e. memory access violation.
Next, the registers and their values are given when kernel panicking.
Amongst those, the RIP
(program counter) register holds 0xffffff801f746340
, which is the faulting instruction.
The CR2
register contains the address the program attempted to access when a page fault occurs.
Fault CR2: 0xffffff8031990000, Error code: 0x0000000000000000, Fault CPU: 0x1 VMM, PL: 0, VF: 1
The kernel backtrace allows us to determine function call path, it’s essential for finding the root cause of the panic.
Backtrace (CPU 1), Frame : Return Address
0xffffff80b771b330 : 0xffffff801f66c1c6
0xffffff80b771b380 : 0xffffff801f795274
0xffffff80b771b3c0 : 0xffffff801f787544
0xffffff80b771b430 : 0xffffff801f61e1e0
0xffffff80b771b450 : 0xffffff801f66bc3c
0xffffff80b771b580 : 0xffffff801f66b9fc
0xffffff80b771b5e0 : 0xffffff801f78776f
0xffffff80b771b750 : 0xffffff801f61e1e0
0xffffff80b771b770 : 0xffffff801f746340
0xffffff80b771b880 : 0xffffff801fc40502
0xffffff80b771b8c0 : 0xffffff801fc3faac
0xffffff80b771bcd0 : 0xffffff801f680c61
0xffffff80b771bec0 : 0xffffff801f67fcf5
0xffffff80b771bf20 : 0xffffff7fa1c27d5b
0xffffff80b771bf60 : 0xffffff7fa1c27ac2
0xffffff80b771bfa0 : 0xffffff801f61d557
Kernel Extensions in backtrace:
cn.junkman.kext.xnu_printf_prec_panic(0.0.1)[F2B9A806-51AA-3ADE-903C-1A8DD67EE4A6]@0xffffff7fa1c27000->0xffffff7fa1c29fff
Walking thru the backtrace, the validation kext is inside, it loaded at 0xffffff7fa1c27000
, ends at 0xffffff7fa1c29fff
(inclusive), and its wired size is 0x3000
0x3000 = (0xffffff7fa1c29fff - 0xffffff7fa1c27000 + 1)
Bundle identifier: cn.junkman.kext.xnu_printf_prec_panic
Version: 0.0.1
Kext executable Mach-O UUID: F2B9A806-51AA-3ADE-903C-1A8DD67EE4A6
(from v0000.00.01 tag, debug build)
BSD process name corresponding to current thread: kernel_task
Boot args: -v
Mac OS version:
17G65
Kernel version:
Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64
Kernel UUID: 1AE5ACFD-3B6F-3D74-AD52-31F1430DBC6F
Kernel slide: 0x000000001f400000
Kernel text base: 0xffffff801f600000
__HIB text base: 0xffffff801f500000
System model name: VMware7,1 (Mac-E43C1C25D4880AD6)
The kernel image file located at /System/Library/Kernels/kernel
Kernel slide: 0x000000001f400000
is of importance as it contains the delta the kernel image was shifted in memory due to kASLR
Kernel text base: 0xffffff801f600000
is the beginning of the kernel TEXT code, if you use otool(1) to grep LC_SEGMENT_64
(64 for 64-bits kernel) of /System/Library/Kernels/kernel
, you could see a fixed virtual address of kernel TEXT code:
$ otool -l /System/Library/Kernels/kernel | less
Mach header
magic cputype cpusubtype caps filetype ncmds sizeofcmds flags
0xfeedfacf 16777223 3 0x00 2 23 4792 0x00200001
Load command 0
cmd LC_SEGMENT_64
cmdsize 392
segname __TEXT
vmaddr 0xffffff8000200000
vmsize 0x0000000000800000
fileoff 0
filesize 8388608
maxprot 0x00000007
initprot 0x00000005
nsects 4
flags 0x0
...
0xffffff801f600000 - 0x000000001f400000 = 0xffffff8000200000
The __HIB
is also kASLRed, it contains hibernation status(need confirmation).
Track down the kernel panic
Use given info above, we can now track down the kernel backtrace, to find the root cause of the panic.
We should start from the bottom frame of the backtrace, it’s is the deepest frame.
Use Hopper Disassembler to load /System/Library/Kernels/kernel
, due to the fact that kASLR slides the kernel in memory, we need to tell Hopper to rebase this image. Click Modify
then Change File Base Address
, in such case, the kernel __HIB
TEXT base is 0xffffff801f500000
Once the kernel image is rebased, hit ‘G’ and enter the address at the bottom of the backtrace, i.e. 0xffffff801f61d557
:
As we can see in the following disassembly, this address maps to an instruction immediately following a call
instruction:
When the CPU encounters such a call
instruction, it saves the address of the next instruction on the stack. This allows it to know where to return, when the call
has completed. When the kernel is preparing the panic report, it generates the backtrace by walking the stack and finding these saved addresses.
Thus, when we encounter a backtrace address such as 0xffffff801f61d557
, the call
instruction that immediately proceeds this, is a function that was invoked leading to the faulting instruction.
So here for example, we know the call rcx
at 0xffffff801f61d555
was invoked by the kernel causes the panic.
Next kernel frame is 0xffffff7fa1c27ac2
, as mentioned before, it lies between [0xffffff7fa1c27000, 0xffffff7fa1c29fff]
, it’s a address of cn.junkman.kext.xnu_printf_prec_panic
Thus we need to use Hopper Disassembler to load xnu_printf_prec_panic.kext/Contents/MacOS/xnu_printf_prec_panic
, use sleights aforementioned to disassemble address 0xffffff7fa1c27ac2
:
Here again, the call _printf_prec_panic_test
at 0xffffff7fa1c27abd
causes the kernel panic.
Continually follow up frames in backtraces, we would finally get an annotated backtrace:
Framee #0: kernel call_continuation + 0x15
0xffffff801f61d555: call rcx
Framee #1: xnu_printf_prec_panic _thread_runloop + 0x1d
0xffffff7fa1c27abd: call _printf_prec_panic_test
Framee #2: xnu_printf_prec_panic _printf_prec_panic_test + 0x166
0xffffff7fa1c27d56: call 0xffffff7f43851028 ; _printf
Framee #3: kernel _printf + 0x50
0xffffff801f67fcf0: call sub_ffffff800027fd20
Framee #4: kernel sub_ffffff800027fd20 + 0xf3c
0xffffff801f680c5c: call _os_log_with_args
Framee #5: kernel sub_ffffff801fc3ef70 + 0xb37
0xffffff801fc3faa7: call sub_ffffff801fc40460
Framee #6: kernel sub_ffffff801fc40460 + 0x9d
0xffffff801fc404fd: call ___strlcpy_chk
Framee #7: kernel ___strlcpy_chk + 0x80
0xffffff801f746340: cmp byte [rax], 0x0 ; CODE XREF=___strlcpy_chk+30, ___strlcpy_chk+13
Framee #8: kernel _return_from_trap + 0xdb
0xffffff801f61e1db: call _kernel_trap
Framee #9: kernel _kernel_trap + 0x70a
0xffffff801f78776a: call _panic
Framee #10: kernel _panic + 0x57
0xffffff801f66b9f7: call _panic_trap_to_debugger
Framee #11: kernel _panic_trap_to_debugger + 0x21b
0xffffff801f66bc3b: int3
Framee #12: kernel _return_from_trap + 0xdb
0xffffff801f61e1db: call _kernel_trap
Framee #13: kernel _kernel_trap + 0x4df
0xffffff801f78753f: call _kdp_i386_trap
Framee #14: kernel _kdp_i386_trap + 0x114
0xffffff801f79526f: call _handle_debugger_trap
Framee #15: kernel _handle_debugger_trap + 0x4c1
0xffffff801f66c1c1: call _SavePanicInfo
You’ll notice there’re three unnamed procedures:
sub_ffffff800027fd20
sub_ffffff801fc3ef70
sub_ffffff801fc40460
Since the kernel is release build, thus the compiler may do heavy optimization jobs to speed up code execution, in short, those functions are inlined.
If you dive into XNU kernel source, and seek around, you’ll find sub_ffffff800027fd20
is inlined vprintf_internal
Use the same technique, we get:
sub_ffffff800027fd20 (inlined vprintf_internal)
sub_ffffff801fc3ef70 (inlined _os_log_to_log_internal)
sub_ffffff801fc40460 (inlined _encode_data)
From the kext’s perspective, it’s the call 0xffffff7f43851028
instruction at 0xffffff7fa1c27d56
leads up the kernel panic:
loc_ffffff7fa1c27d19:
ffffff7fa1c27d19 lea rdi, qword [aXnuprintfprecp_fc8] ; argument "format" for method _printf, "xnu_printf_prec_panic: [DBG] %llu: %.*s\\n", CODE XREF=_printf_prec_panic_test+220
ffffff7fa1c27d20 mov rax, qword [rbp+var_10]
ffffff7fa1c27d24 movsxd rcx, dword [_printf_prec_panic_test.fixed] ; _printf_prec_panic_test.fixed
ffffff7fa1c27d2b mov byte [rax+rcx], 0x3e
ffffff7fa1c27d2f mov rax, qword [rbp+var_10]
ffffff7fa1c27d33 mov edx, dword [_printf_prec_panic_test.fixed] ; _printf_prec_panic_test.fixed
ffffff7fa1c27d39 add edx, 0x1
ffffff7fa1c27d3c movsxd rcx, edx
ffffff7fa1c27d3f mov byte [rax+rcx], 0x3e
ffffff7fa1c27d43 mov rsi, qword [_printf_prec_panic_test.i] ; _printf_prec_panic_test.i
ffffff7fa1c27d4a mov edx, dword [rbp+var_4]
ffffff7fa1c27d4d sub edx, 0x1
ffffff7fa1c27d50 mov rcx, qword [rbp+var_10]
ffffff7fa1c27d54 mov al, 0x0
ffffff7fa1c27d56 call 0xffffff7f43851028 ; _printf
ffffff7fa1c27d5b mov dword [rbp+var_20], eax
printf
’s prototype is int printf(const char *fmt, ...);
, and its fmt
is "xnu_printf_prec_panic: [DBG] %llu: %.*s\n"
, so, there’re extra three arguments.
According to x86-64 calling convention, the registers RDI, RSI, RDX, RCX, R8, and R9 are used for integer and memory address arguments.
_printf_prec_panic_test.i
and _printf_prec_panic_test.fixed
stay in .bss section, they’re static variables, it indicates that the panic happened when formatting the fmt
string.
Actually, use kernel.development
can produce verbose backtrace since the kernel isn’t compiled with aggressive optimizations:
* thread #2, name = '0xffffff80338f1650', queue = '0x0', stop reason = signal SIGSTOP
* frame #0: 0xffffff801aef81b6 kernel.debug`strlcpy(dst="0x352be8a0", src="0x352be8a0>>ᆳ?"..., siz=11) at subrs.c:566
frame #1: 0xffffff801aef8659 kernel.debug`__strlcpy_chk(dst="0x352be8a0", src="0x352be8a0>>ᆳ?"..., s=11, chk_size=18446744073709551615) at subrs.c:698
frame #2: 0xffffff801b77a76f kernel.debug`_encode_data(content=0xffffff82f8d63bda, arg=0xffffff80352be8a0, arg_len=11, context=0xffffff82f8d63a20) at log_encode.h:64
frame #3: 0xffffff801b779b25 kernel.debug`_os_log_encode [inlined] _os_log_encode_arg(arg=0xffffff80352be8a0, arg_len=11, ctype='\x02', is_private=false, context=0xffffff82f8d63a20) at log_encode.h:208
frame #4: 0xffffff801b779831 kernel.debug`_os_log_encode(format="xnu_printf_prec_panic: [DBG] %llu: %.*s\n", args=0xffffff82f8d63aa0, saved_errno=0, context=0xffffff82f8d63a20) at log_encode.h:457
frame #5: 0xffffff801b776447 kernel.debug`_os_log_to_log_internal(oslog=0xffffff801bc8eb80, type='\0', format="xnu_printf_prec_panic: [DBG] %llu: %.*s\n", args=0xffffff82f8d63f40, addr=0xffffff7f9d742c7c, dso=0xffffff7f9d742000) at log.c:315
frame #6: 0xffffff801b772dcd kernel.debug`_os_log_with_args_internal(oslog=0xffffff801bc8eb80, type='\0', format="xnu_printf_prec_panic: [DBG] %llu: %.*s\n", args=0xffffff82f8d63f40, addr=0xffffff7f9d742c7c, dso=0x0000000000000000) at log.c:189
frame #7: 0xffffff801b772e3b kernel.debug`os_log_with_args(oslog=0xffffff801bc8eb80, type='\0', format="xnu_printf_prec_panic: [DBG] %llu: %.*s\n", args=0xffffff82f8d63f40, addr=0xffffff7f9d742c7c) at log.c:160
frame #8: 0xffffff801ad83115 kernel.debug`vprintf_internal(fmt="xnu_printf_prec_panic: [DBG] %llu: %.*s\n", ap_in=0xffffff82f8d63f40, caller=0xffffff7f9d742c7c) at printf.c:846
frame #9: 0xffffff801ad82f12 kernel.debug`printf(fmt="xnu_printf_prec_panic: [DBG] %llu: %.*s\n") at printf.c:859
frame #10: 0xffffff7f9d742c7c xnu_printf_prec_panic`thread_runloop [inlined] printf_prec_panic_test at xnu_printf_prec_panic.c:51 [opt]
frame #11: 0xffffff7f9d742bbe xnu_printf_prec_panic`thread_runloop(arg=<unavailable>, wres=<unavailable>) at xnu_printf_prec_panic.c:82 [opt]
frame #12: 0xffffff801af59307 kernel.debug`call_continuation + 23
Above verbose backtrace give us root cause of why the kernel panicked, when printf
format "%[.precision]s"
, in such case, it’s %.*s
.
Its call tree:
_os_log_encode()
{
...
case 's': // string
value.type.pch = va_arg(args, char *);
if (!prec && value.type.pch != NULL) {
prec = (int) strlen(value.type.pch) + 1;
}
#ifndef KERNEL
// capture the string pointer to generate a symptom
if (context->log && context->log->generate_symptoms && context->arg_idx == 0 && value.type.pch) {
context->symptom_str = value.type.pch;
}
#endif
context->buffer->flags |= OS_LOG_BUFFER_HAS_NON_SCALAR;
_os_log_encode_arg(value.type.pch, prec, OS_LOG_BUFFER_VALUE_TYPE_STRING, false, context);
prec = 0;
done = true;
break;
...
}
_os_log_encode_arg()
{
...
case OS_LOG_BUFFER_VALUE_TYPE_STRING:
context->buffer->flags |= OS_LOG_BUFFER_HAS_NON_SCALAR;
if (_os_log_string_is_public(arg)) {
content->flags &= ~OS_LOG_CONTENT_FLAG_PRIVATE;
}
_encode_data(content, arg, arg_len, context);
break;
...
}
_encode_data()
{
...
if (arginfo.length) {
if (content->type == OS_LOG_BUFFER_VALUE_TYPE_STRING
#ifndef KERNEL
|| content->type == OS_LOG_BUFFER_VALUE_TYPE_OBJECT
#endif
) {
strlcpy(databuf, arg, arginfo.length);
} else {
memcpy(databuf, arg, arginfo.length);
}
}
...
}
size_t strlcpy(char *dst, const char *src, size_t siz)
{
char *d = dst;
const char *s = src;
size_t n = siz;
/* Copy as many bytes as will fit */
if (n != 0 && --n != 0) {
do {
if ((*d++ = *s++) == 0)
break;
} while (--n != 0);
}
/* Not enough room in dst, add NUL and traverse rest of src */
if (n == 0) {
if (siz != 0)
*d = '\0'; /* NUL-terminate dst */
while (*s++) /* XXX: causes the kernel panic */
;
}
return(s - src - 1); /* count does not include NUL */
}
_encode_data
assume when content->type
is OS_LOG_BUFFER_VALUE_TYPE_STRING
, the given string arg
is NUL-terminated.
It calls down to strlcpy
, which requires its argument src
is NUL-terminated, yet it’s not the case, thus triggered memory access violation.
_os_log_encode
not deal such special case when user give a string format with a precision modifier.
Dive into xnu_printf_prec_panic
We haven’t talk thoroughly about xnu_printf_prec_panic kext, it’s a validation kext to reproduce the panic bug.
The core code listed below:
#define PANIC_QUICKLY 1
/**
* Do printf() precision print panic test
*/
static void printf_prec_panic_test(void)
{
static volatile uint64_t i = 0;
static int fixed = 10;
int sz = fixed + 2;
char *p = _MALLOC(sz, M_TEMP, M_NOWAIT);
int n;
i++;
if (p == NULL) {
LOG_ERR("_MALLOC() fail due to OOM i: %llu", i);
return;
}
n = snprintf(p, sz, "%#010x", (uint32_t) p);
kassertf(n == fixed, "n is %d", n);
kassertf(p[fixed] == '\0', "p[%d] is %#x", fixed, p[fixed]);
/* Manually clear the EOS to test printf() precision print */
p[fixed] = '>';
p[fixed+1] = '>';
LOG_DBG("%llu: %.*s", i, sz-1, p);
#if PANIC_QUICKLY
/**
* Won't free `p'
* This memleak used to increase kernel address entropy
* so the kernel can panic quickly
*/
#else
_FREE(p, M_TEMP);
#endif
}
Its behaviour just like aforementioned, if you define PANIC_QUICKLY
to 0
, i.e. without any memory leakage, the panic bug isn’t reproducible stably, since when you _FREE
an address, the address will be recycled(duh), in next _MALLOC
, you may have a chance to get recently recycled address, i.e. the addresses in last recent _FREE
calls, in such case, if previous addresses not cause kernel panic, it likely won’t cause panic as before.
Thus when we don’t free the memory(will causes memory leakage, it won’t lead to panic), we can increase free memory poll address entropy, thus the kernel can panic quickly.
Severity
This bug can be reproduce steadily in macOS which uses log(1) as its primary logging system.
The log(1) logging system first adopted in macOS Sierra(10.12), currently this bug affects macOS [10.12, 10.14.3(18D42)].
As comparison, this bug cannot be reproduce in macOS 10.10.5, 10.11.6, which it use syslog(1) for logging.
Dude, it’s not a feature, :-P
Fixup
In short, the workaround is, when you use printf
with %[.precision]s
format, make sure the C-string is NUL-terminated, or simply don’t use precision-decorated string format.
This bug seems also reproducible for IOLog
(XXX: need confirmation).
References
XNU kernel precision printf() C-string panic validation program
An Unpatched Kernel Bug, apple’s AMDRadeonX4150 kext triggered a kernel panic…why?