***************************************************************************** * Kernel debugging, part 2: "When things become hairy" * * * * Copyright (c) 2002 Daniel P. Bovet, Marco Cesati, and Cosimo Comella * * Permission is granted to copy, distribute and/or modify this document * * under the terms of the GNU Free Documentation License, Version 1.1, * * published by the Free Software Foundation; with no Invariant Sections, * * with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the * * license is included in the file named LICENSE. * * * * (version 1.1) * ***************************************************************************** In the previous lecture we discussed how to locate a kernel bug that freezes a process or produces a kernel oops. It turned out that both the "ps" command and the kernel oops itself may yield lots of information, provided that we have a way to relate kernel linear addresses and source symbol names. Such relationship is stored in the Symbol.map file produced at kernel compile time; when debugging modules, moreover, we also need to determine where each module is loaded in RAM; this can be easily done by analyzing the /proc/ksyms virtual file. In this lecture we'll continue our survey of debugging techniques by presenting a few more "live" examples. So, let's start with the first one. Example 4. The "mystery4" program freezes. The "mystery4" program doesn't respond to signals, and it cannot be stopped or interrupted by the control terminal. As usual, we can collect information about the process using the "ps" command: $ ps u -C mystery4 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND cesati 232 99.4 0.1 1204 324 pts/1 R 11:08 2:28 ./mystery4 The "R" status flag means that the process is running. Since it is not sleeping, it doesn't make sense to ask "ps" to tell us the wait channel address: $ ps -o pid,wchan -C mystery4 PID WCHAN 232 - The "ps" command doesn't tell us whether the process is running in Kernel Mode or in User Mode. However, we can deduce that the process is freezed in Kernel Mode from the fact that it is not killed even by SIGKILL signals. The kernel checks whether a process has pending signals right before returning in User Mode. Thus, the problem seems to be that "mystery4" keeps running in Kernel Mode: it looks like a kernel bug. Just to make sure, let's try the "strace" command to trace system calls issued by the process. We use the "-p" option, which allows to trace the system calls issued by an already existing process: $ strace -p 232 [no output] Nothing appears: this means that the process is not entering or exiting system calls, as we suspected. Till now, we did't succeed in finding meaningful information. Let's try with the "ps" command again: it can also display the value stored in the eip (the "instruction pointer") and esp (the "stack pointer") registers relative to the process: $ ps -o pid,eip,esp -C mystery4 PID EIP ESP 232 400e0004 bffff95c At last we found some addresses. However, they do not look very promising, because they fall into the User Mode linear address space (first three gigabytes, i.e., below 0xc0000000). Where they came from? In Linux, the "ps" command fetches information from the /proc virtual filesystem. In fact, for any process of the system the kernel creates a virtual subdirectory below /proc named as the process's PID: $ ls -l /proc/232/ total 0 -r--r--r-- 1 cesati users 0 Jan 15 11:47 cmdline lrwxrwxrwx 1 cesati users 0 Jan 15 11:47 cwd -> /home/cesati/LKHC -r-------- 1 cesati users 0 Jan 15 11:47 environ lrwxrwxrwx 1 cesati users 0 Jan 15 11:47 exe -> /home/cesati/LKHC/mystery4 dr-x------ 2 cesati users 0 Jan 15 11:47 fd -r--r--r-- 1 cesati users 0 Jan 15 11:47 maps -rw------- 1 cesati users 0 Jan 15 11:47 mem lrwxrwxrwx 1 cesati users 0 Jan 15 11:47 root -> / -r--r--r-- 1 cesati users 0 Jan 15 11:47 stat -r--r--r-- 1 cesati users 0 Jan 15 11:47 statm -r--r--r-- 1 cesati users 0 Jan 15 11:47 status The "stat" virtual file yields lots of data (the proc(5) manual page provides a description of any field returned by the "stat" file): $ cat /proc/232/stat 232 (mystery4) R 203 232 203 34817 232 0 14 0 82 0 0 283053 0 0 17 0 0 0 122207 1232896 81 4294967295 134512640 134514112 3221224048 3221223772 1074659332 262402 0 0 0 0 0 0 17 0 The "ps" command extracts the contents of the eip and esp registers from here. (Notice that decimal 1074659332 corresponds to hex 0x400e0004, and that decimal 3221223772 corresponds to hex 0xbffff95c.) In general, however, these values refer to the contents of the registers as saved in the Kernel Mode stack when the process switched from User Mode to Kernel Mode. They say nothing about what the process is currently doing in Kernel Mode. However, we can determine from the eip and esp registers returned by "ps" what kind of system call has been issued by the process before the freeze. The technique is somewhat similar to those shown in the previous lecture. We must establish a mapping between the User Mode linear addresses and the source code of the program. To do this, we can just read the "maps" virtual file of the /proc/232 directory: $ cat /proc/232/maps 08048000-08049000 r-xp 00000000 03:05 295144 /home/cesati/LKHC/mystery4 08049000-0804a000 rw-p 00000000 03:05 295144 /home/cesati/LKHC/mystery4 40000000-40015000 r-xp 00000000 03:05 749242 /lib/ld-2.2.3.so 40015000-40016000 rw-p 00014000 03:05 749242 /lib/ld-2.2.3.so 40016000-40017000 rw-p 00000000 00:00 0 40021000-40128000 r-xp 00000000 03:05 749247 /lib/libc-2.2.3.so 40128000-4012f000 rw-p 00106000 03:05 749247 /lib/libc-2.2.3.so 4012f000-40133000 rw-p 00000000 00:00 0 bfffe000-c0000000 rwxp fffff000 00:00 0 As you see, several regions of the User Mode linear address space map all sections of both the program and the shared libraries used by the program. Executable code sections are marked as "r-xp", while data sections are marked as "rw-p". The last region marked as "rwxp" maps the User Mode stack. It turns out that the 0x400e0004 falls inside the region mapping the executable code of the C library. As usual, we can compute the offset inside the region (0x400e0004-0x40021000=0xbf004=782340), and thus derive the offset inside the executable file /lib/libc-2.2.3.so (we should add the file offset where the mapping starts, 0 in this case). Now, we can disassemble the C library executable file: $ objdump -d --start-address=0xbeff0 /lib/libc.2.2.3.so 00000000000beff0 <__libc_read>: beff0: 53 push %ebx beff1: 8b 54 24 10 mov 0x10(%esp,1),%edx beff5: 8b 4c 24 0c mov 0xc(%esp,1),%ecx beff9: 8b 5c 24 08 mov 0x8(%esp,1),%ebx beffd: b8 03 00 00 00 mov $0x3,%eax bf002: cd 80 int $0x80 bf004: 5b pop %ebx bf005: 3d 01 f0 ff ff cmp $0xfffff001,%eax bf00a: 73 01 jae bf00d <__libc_read+0x1d> bf00c: c3 ret [snipped] The "int $0x80" instruction raises the system call whose number is contained in the eax register, 3 in our case. The include/asm-i386/unistd.h file lists all system call numbers: ====== excerpt from include/asm-i386/unistd.h ====== #define __NR_exit 1 #define __NR_fork 2 #define __NR_read 3 #define __NR_write 4 #define __NR_open ==================================================== Therefore, the process keeps executing the read() system call service routine. It is just a bit of information, but it is better than nothing. However, to effectively trace this kernel bug we need some help from the kernel itself. The Linux kernel doesn't include a self-debugger, as some other Unix kernels do. However, it can be compiled in such a way to make kernel debugging much easier. When defining the compilation options, you are asked whether you want kernel debugging support: $ make config [snipped] Kernel debugging (CONFIG_DEBUG_KERNEL) [Y/n/?] Debug high memory support (CONFIG_DEBUG_HIGHMEM) [N/y/?] Debug memory allocations (CONFIG_DEBUG_SLAB) [N/y/?] Memory mapped I/O debugging (CONFIG_DEBUG_IOVIRT) [N/y/?] Magic SysRq key (CONFIG_MAGIC_SYSRQ) [Y/n/?] Spinlock debugging (CONFIG_DEBUG_SPINLOCK) [N/y/?] Verbose BUG() reporting (adds 70K) (CONFIG_DEBUG_BUGVERBOSE) [N/y/?] You should always compile "Magic SysRq key" support when hacking the kernel. In my opinion, you should compile it even for production kernels, because the magic keys might save your system integrity in nasty situations. If you are worried about the security implications, you may temporarily disable the magic keys by issuing as root: # echo "0" > /proc/sys/kernel/sysrq (In earlier kernel versions magic keys were disabled by default, and before using them you had to enable them.) A kernel with support for the magic keys answers to specific combinations of key presses starting with (usually + on PCs). For any key press combination, the kernel performs a specific action, usually writing the result in a kernel message. For instance, the + combination produces a short list of allowed magic keys combination. If you don't see the message, try to dump the kernel messages buffer with the dmesg command: (+ pressed) $ dmesg [snipped] SysRq : HELP : loglevel0-8 reBoot tErm kIll saK killalL showMem Off showPc unRaw Sync showTasks Unmount Each capitalized letter represents a key press combination. For instance, + reboots the machine without syncing or unmounting the disks. Be careful! To our purposes, the + magic key is invaluable. It shows some living information on all processes in the system: (+) $ dmesg [snipped] SysRq : Show State free sibling task PC stack pid father child younger older init S C131FF2C 4868 1 0 200 4 (NOTLB) Call Trace: [] [] [] [] [] [snipped] bash S 7FFFFFFF 4 222 220 (NOTLB) Call Trace: [] [] [] [] [] mystery4 R current 2656 232 203 206 (NOTLB) Call Trace: [] [] [] [] [] [snipped] As you see, our process is marked as running "R" and in execution when the magic key was pressed. The "Call Trace" line shows the linear addresses stored in the Kernel Mode stack of the process. As usual, we must resolve them by using the Symbol.map file (for statically linked kernel functions) and the /proc/ksyms virtual file (for dynamically loaded modules: 0xc0113c55: function schedule() at offset 0x1c9 = 457 0xcd0530c9: module hack4 at offset 0x69 = 105 0xc01472ba: function proc_file_read() at offset 0xf2 = 242 0xc012fdb6: function sys_read() at offset 0x96 = 150 0xc0106b0b: function system_call() at offset 0x33 = 51 Of course, pressing again the + magic key might produce a different output, depending on the code being executed when the keyboard interrupt occurred. In any case, the output gives us the information we were looking for. The code at offset 457 of schedule() is $ gdb linux-2.4.17-lkhc/vmlinux GNU gdb 5.0 Copyright 2000 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-slackware-linux"...(no debugging symbols found)... (gdb) disassemble schedule Dump of assembler code for function schedule: 0xc0113a8c : mov $0xffffe000,%eax 0xc0113a91 : and %esp,%eax 0xc0113a93 : push %ebp 0xc0113a94 : mov %esp,%ebp 0xc0113a96 : sub $0x18,%esp [snipped] 0xc0113c4e : mov %eax,0x34(%ebx) 0xc0113c51 : sti 0xc0113c52 : cmp %ebx,0xfffffff8(%ebp) 0xc0113c55 : jne 0xc0113c60 0xc0113c57 : mov 0xfffffff8(%ebp),%ebx 0xc0113c5a : jmp 0xc0113d46 [snipped] Ehm, nothing very useful here. It looks a regular invocation of the schedule() function. Let's try with the module code. $ objdump -d linux-2.4.17-lkhc/lkhc/hack4.o [snipped] 0000000000000048 : 48: 8b 54 24 0c mov 0xc(%esp,1),%edx 4c: a1 00 00 00 00 mov 0x0,%eax 51: 50 push %eax 52: 68 06 00 00 00 push $0x6 57: 03 54 24 0c add 0xc(%esp,1),%edx 5b: 52 push %edx 5c: e8 fc ff ff ff call 5d 61: 83 c4 0c add $0xc,%esp 64: e8 fc ff ff ff call 65 69: eb f9 jmp 64 6b: 90 nop 6c: 8d 74 26 00 lea 0x0(%esi,1),%esi This looks interesting. The instructions at offset 0x64 and 0x69 are an endless loop. To have a confirm, we checks the source file: $ cat linux-2.4.17-lkhc/lkhc/hack4.c [snipped] static int hack4_read(char *page, char **start, off_t off, int count, int *eof, void *data) { int written; written = sprintf(page+off, "jiffies=%ld ", jiffies); for(;;) schedule(); return written; } [snipped] As usual in our toy examples, the source of the problem is in the reading function of some special file. In this case, the function continously invokes schedule() without terminating. Of course, finding bugs in the real world is much more complicated, but the techniques shown here are always the basis of the bug hunting procedure. Example 5. Adding a new Magic Key. Although we discovered what the "mystery4" process is doing, we cannot destroy it without rebooting the system. That's quite boring. We are supposed to be kernel hacker, don't we? Thus, let us write a simple module that, once loaded, allows us to force a given "victim" process to generate a kernel oops and die. Let's call the new module "genoops". It should add new Magic Key 'd': when the user presses +, the kernel answers forcing an oops as soon as the victim process goes in execution. The PID of the victim process is passed as parameter when the module is loaded. From the previous lectures we already know the main steps required to write a new module, so I'll just quickly sketch them. STEP 1. We create entries for the new feature in arch/i386/config.in, and in the lkhc/Makefile (where our source code files are). ====== excerpt from arch/i386/config.in ====================================== if [ "$CONFIG_LKHC" != "n" ]; then tristate ' Linux Kernel Hacking Course hack #4' CONFIG_LKHC_HACK4 tristate ' Kernel OOPS generator' CONFIG_LKHC_GENOOPS fi ============================================================================== ====== excerpt from lkhc/Makefile ============================================ obj-$(CONFIG_LKHC_GENOOPS) += genoops.o ============================================================================== STEP 2. We create a skeleton source code lkhc/genoops.c file for the module: ===== excerpt from lkhc/genoops.c ============================================ #include #include #include #include static int __init genoops_init(void) { return 0; } static void __exit genoops_exit(void) { } EXPORT_NO_SYMBOLS; module_init(genoops_init); module_exit(genoops_exit); MODULE_AUTHOR("Marco Cesati"); MODULE_DESCRIPTION("LKHC Oops generator"); MODULE_LICENSE("GPL"); ============================================================================== This code should already produce a loadable module. STEP 3. We must register a new Magic Key. In order to do this, we include the include/linux/sysrq.h header file, then we allocate a new struct sysrq_key_op data structure: ========== excerpt from lkhc/genoops.c ======================================= ... #include static struct sysrq_key_op sysrq; static char magickey = 'd'; static char *help = "Dump"; static char *action = "Forcing oops (LKHC's genoops)"; =============================================================================== We also allocate a few static strings that store the our magic key, its name (as shown in +), and a message shown when the magic key is pressed. STEP 4. We must take care of the victim process PID. Let's define a "victim" variable that is also a parameter of the module: ========== excerpt from lkhc/genoops.c ======================================== ... static pid_t victim = 0; ... MODULE_AUTHOR("Marco Cesati"); MODULE_DESCRIPTION("LKHC Oops generator"); MODULE_LICENSE("GPL"); MODULE_PARM(victim, "i"); MODULE_PARM_DESC(victim, "PID of victim process"); =============================================================================== STEP 5. We initialize the struct sysrq_key_op data structure, and we register it in genoops_init(): ============ excerpt from lkhc/genoops.c ====================================== static int __init genoops_init(void) { int ret; if (!victim) { printk(KERN_INFO "genoops module: parameter \"victim\" required\n"); return 1; } sysrq.help_msg = help; sysrq.action_msg = action; sysrq.handler = handle_genoops; ret = register_sysrq_key(magickey, &sysrq); if (ret) printk(KERN_INFO "genoops module: loading failed (%d)\n", ret); else printk(KERN_INFO "genoops module: loading ok (victim PID=%d)\n", victim); return ret; } =============================================================================== Note that the function checks whether the user has specified a valid PID for the victim process. (Returning 1 aborts the module loading operation.) The handle_genoops() function is invoked whenever a user presses the new magic key. We'll describe it later. STEP 6. Now we must take care of module unloading, so let's customize the genoops_exit() function: ========== excerpt from lkhc/genoops.c ======================================== static void __exit genoops_exit(void) { if (unregister_sysrq_key(magickey, &sysrq)) printk(KERN_INFO "genoops module: error while unloading!\n"); else printk(KERN_INFO "genoops module: unloading ok\n"); } =============================================================================== Actual unregistering is done through the unregister_sysrq_key() function, passing to it both the magic key and the sysrq_key_op data structure. (Yes, kernel hackers are quite paranoid. The function refuses to unregister the magic key if the data structure doesn't correspond to it.) STEP 7. How can we kill the victim process? The most simple method consists of forcing the victim to execute a faulty code, for instance a code that dereferences a NULL pointer. Let us define a minimal function that does this: ========== excerpt from lkhc/genoops.c ======================================== static void __oops_code(void) { *(unsigned long *)0 = 0; } =============================================================================== The "NULL pointer dereference" fault will also cause the kernel to generate a useful oops message. STEP 8. Now we should write our magic key handler. Return type and parameters of the handler are defined by the kernel API as follows: ============ excerpt from lkhc/genoops.c ====================================== static void handle_genoops(int key, struct pt_regs *pt_regs, struct kbd_struct *kbd, struct tty_struct *tty) { } =============================================================================== We don't care too much about the parameters right now. They represent the ASCII code of the magic key, and pointers to several data structures related to the process that is in execution when the user presses the magic key. STEP 8.a The first task of the handler consists of deriving from the PID of the victim the address of the corresponding process descriptor: ========== excerpt from lkhc/genoops.c ======================================== struct task_struct *ts; ts = find_task_by_pid(victim); if (!ts) { printk(KERN_INFO "genoops module: victim (PID %d) not found\n", victim); return; } =============================================================================== The find_task_by_pid() function searches in a hash table the address of the descriptor whose PID is passed as parameter. If the search fails, the victim process doesn't exist, and the module issues a warning message and returns. STEP 8.b. Using our module can be quite dangerous for the system safety. At least, our module should refuse to kill the kernel threads. Any kernel thread is recognizable by not having a custom memory descriptor: ========== excerpt from lkhc/genoops.c ======================================== if (!ts->mm) { printk(KERN_INFO "genoops module: the victim is a kernel thread\n"); return; } =============================================================================== STEP 8.c. Ok, the victim must die. How can we force it to execute the __oops_code() function? That's simple if the victim is *not* the process currently in execution: ========== excerpt from lkhc/genoops.c ======================================== if (ts != current) { ts->thread.eip = __oops_code; return; } =============================================================================== We must defer the execution of __oops_code(), otherwise we would kill an innocent process. In particular, we must wait until the scheduler selects again the victim process for execution. When a process is not in execution the address of the code that it was last executing is saved in the thread.eip field of the process descriptor. Thus, the handler might limit itself to rewrite such field with the address of our __oops_code() function. When the victim is selected by the scheduler, the kernel forces the value in ts->thread.eip into the instruction pointer register, thus a jump is made to __oops_code(). STEP 8.d If the victim is the process currently in execution, we cannot directly invoke the __oops_code() function. In fact, the magic key handler is invoked by the interrupt handler of the keyboard, and generating an oops right now will likely crash the whole system. There are two cases, depending on whether the process was interrupted in Kernel Mode or in User Mode. ========== excerpt from lkhc/genoops.c ======================================== if ((regs->eflags & VM_MASK) || (3 & regs->xcs)) { force_sig(SIGABRT, ts); return; } =============================================================================== In this case the victim was interrupted in User Mode. It doesn't make sense to generate an oops. Rather, we kill the process by sending a SIGABRT signal (this signal might also generate a core dump of the User Mode execution state of the victim). STEP 8.e The last case occurs when the victim was interrupted in Kernel Mode. The handler modifies the return address of the keyboard interrupt handler, which is saved on the Kernel Mode stack and pointed to by the regs->eip field. ========== excerpt from lkhc/genoops.c ======================================== regs->eip = __oops_code; =============================================================================== Example 6. Killing the "mystery4" process with the genoops module. Now it's time to compile and try our new module. We shall use it to kill the boring "mystery4" process (PID 232): # insmod lkhc/genoops.o victim=232 (+ pressed) # dmesg [snipped] SysRq : Forcing oops (LKHC's genoops) Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: cd053060 *pde = 00000000 Oops: 0002 CPU: 0 EIP: 0010:[] Not tainted EFLAGS: 00013246 eax: 00000001 ebx: c32f2000 ecx: 00000000 edx: 00000019 esi: 00000000 edi: 00000019 ebp: c32f3f58 esp: c32f3f34 ds: 0018 es: 0018 ss: 0018 Process mystery4 (pid: 784, stackpage=c32f3000) Stack: c96fbde0 00000080 c52c6000 c32f2000 c0220000 c8689d20 00000000 c32f2000 c025b540 00000080 cd0550c9 c01472ba c52c6000 c32f3f98 00000000 00000080 c32f3f94 00000000 c96fbde0 ffffffea 00000000 00000080 c06bfce0 00000000 Call Trace: [] [] [] [] Code: c7 05 00 00 00 00 00 00 00 00 c3 90 53 8b 5c 24 0c 8b 0d 0c Notice that we obtained a call trace that is almost identical to the call trace produced in first example of this lecture. However, the keyboard interrupt occurred in the hack4 module code, rather than in the schedule() function. CAVEATS: For the sake of simplicity, this module works only on uniprocessor systems. It may easily crash your SMP system. And actually, it is not safe on UP systems either, because the magic key handler doesn't check for nested interrupt occurrences. You have been warned!