*****************************************************************************
*      Kernel debugging, part 2:  "When things become hairy"                *
*                                                                           *
* Copyright (c)  2002  Daniel P. Bovet, Marco Cesati, and Cosimo Comella    *
*    Permission is granted to copy, distribute and/or modify this document  *
*    under the terms of the GNU Free Documentation License, Version 1.1,    *
*    published by the Free Software Foundation; with no Invariant Sections, *
*    with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the *
*    license is included in the file named LICENSE.                         *
*                                                                           *
* (version 1.1)                                                             *
*****************************************************************************

In the previous lecture we discussed how to locate a kernel bug that
freezes a process or produces a kernel oops. It turned out that both the "ps"
command and the kernel oops itself may yield lots of information, provided
that we have a way to relate kernel linear addresses and source symbol names.
Such relationship is stored in the Symbol.map file produced at kernel compile
time; when debugging modules, moreover, we also need to determine where each
module is loaded in RAM; this can be easily done by analyzing the /proc/ksyms
virtual file.

In this lecture we'll continue our survey of debugging techniques by
presenting a few more "live" examples. So, let's start with the first one.

Example 4. The "mystery4" program freezes.

The "mystery4" program doesn't respond to signals, and it cannot be stopped
or interrupted by the control terminal. As usual, we can collect information
about the process using the "ps" command:

$ ps u -C mystery4
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
cesati     232 99.4  0.1  1204  324 pts/1    R    11:08   2:28 ./mystery4

The "R" status flag means that the process is running. Since it is
not sleeping, it doesn't make sense to ask "ps" to tell us the wait channel
address:

$ ps -o pid,wchan -C mystery4
  PID WCHAN
  232 -

The "ps" command doesn't tell us whether the process is running in Kernel 
Mode or in User Mode. However, we can deduce that the process is freezed in
Kernel Mode from the fact that it is not killed even by SIGKILL signals. The
kernel checks whether a process has pending signals right before returning in
User Mode. Thus, the problem seems to be that "mystery4" keeps running in
Kernel Mode: it looks like a kernel bug.

Just to make sure, let's try the "strace" command to trace system calls
issued by the process. We use the "-p" option, which allows to trace the
system calls issued by an already existing process:

$ strace -p 232
[no output]

Nothing appears: this means that the process is not entering or exiting
system calls, as we suspected.

Till now, we did't succeed in finding meaningful information. Let's try with
the "ps" command again: it can also display the value stored in the eip (the
"instruction pointer") and esp (the "stack pointer") registers relative to the
process:

$ ps -o pid,eip,esp -C mystery4
  PID      EIP      ESP
  232 400e0004 bffff95c

At last we found some addresses. However, they do not look very promising,
because they fall into the User Mode linear address space (first three
gigabytes, i.e., below 0xc0000000). Where they came from?

In Linux, the "ps" command fetches information from the /proc virtual
filesystem. In fact, for any process of the system the kernel creates a
virtual subdirectory below /proc named as the process's PID:

$ ls -l /proc/232/
total 0
-r--r--r--  1 cesati  users  0 Jan 15 11:47 cmdline
lrwxrwxrwx  1 cesati  users  0 Jan 15 11:47 cwd -> /home/cesati/LKHC
-r--------  1 cesati  users  0 Jan 15 11:47 environ
lrwxrwxrwx  1 cesati  users  0 Jan 15 11:47 exe -> /home/cesati/LKHC/mystery4
dr-x------  2 cesati  users  0 Jan 15 11:47 fd
-r--r--r--  1 cesati  users  0 Jan 15 11:47 maps
-rw-------  1 cesati  users  0 Jan 15 11:47 mem
lrwxrwxrwx  1 cesati  users  0 Jan 15 11:47 root -> /
-r--r--r--  1 cesati  users  0 Jan 15 11:47 stat
-r--r--r--  1 cesati  users  0 Jan 15 11:47 statm
-r--r--r--  1 cesati  users  0 Jan 15 11:47 status

The "stat" virtual file yields lots of data (the proc(5) manual page provides
a description of any field returned by the "stat" file):

$ cat /proc/232/stat
232 (mystery4) R 203 232 203 34817 232 0 14 0 82 0 0 283053 0 0 17 0 0 0 
122207 1232896 81 4294967295 134512640 134514112 3221224048 3221223772 
1074659332 262402 0 0 0 0 0 0 17 0

The "ps" command extracts the contents of the eip and esp registers from
here. (Notice that decimal 1074659332 corresponds to hex 0x400e0004, and that
decimal 3221223772 corresponds to hex 0xbffff95c.) In general, however, these
values refer to the contents of the registers as saved in the Kernel Mode
stack when the process switched from User Mode to Kernel Mode. They say
nothing about what the process is currently doing in Kernel Mode.

However, we can determine from the eip and esp registers returned by "ps" 
what kind of system call has been issued by the process before the freeze.
The technique is somewhat similar to those shown in the previous lecture.
We must establish a mapping between the User Mode linear addresses and the
source code of the program. To do this, we can just read the "maps" virtual
file of the /proc/232 directory:

$ cat /proc/232/maps
08048000-08049000 r-xp 00000000 03:05 295144     /home/cesati/LKHC/mystery4
08049000-0804a000 rw-p 00000000 03:05 295144     /home/cesati/LKHC/mystery4
40000000-40015000 r-xp 00000000 03:05 749242     /lib/ld-2.2.3.so
40015000-40016000 rw-p 00014000 03:05 749242     /lib/ld-2.2.3.so
40016000-40017000 rw-p 00000000 00:00 0
40021000-40128000 r-xp 00000000 03:05 749247     /lib/libc-2.2.3.so
40128000-4012f000 rw-p 00106000 03:05 749247     /lib/libc-2.2.3.so
4012f000-40133000 rw-p 00000000 00:00 0
bfffe000-c0000000 rwxp fffff000 00:00 0

As you see, several regions of the User Mode linear address space map all
sections of both the program and the shared libraries used by the program.
Executable code sections are marked as "r-xp", while data sections are
marked as "rw-p". The last region marked as "rwxp" maps the User Mode stack.

It turns out that the 0x400e0004 falls inside the region mapping the
executable code of the C library. As usual, we can compute the offset inside
the region (0x400e0004-0x40021000=0xbf004=782340), and thus derive the
offset inside the executable file /lib/libc-2.2.3.so (we should add the 
file offset where the mapping starts, 0 in this case).

Now, we can disassemble the C library executable file:

$ objdump -d --start-address=0xbeff0 /lib/libc.2.2.3.so
00000000000beff0 <__libc_read>:
   beff0:       53                      push   %ebx
   beff1:       8b 54 24 10             mov    0x10(%esp,1),%edx
   beff5:       8b 4c 24 0c             mov    0xc(%esp,1),%ecx
   beff9:       8b 5c 24 08             mov    0x8(%esp,1),%ebx
   beffd:       b8 03 00 00 00          mov    $0x3,%eax
   bf002:       cd 80                   int    $0x80
   bf004:       5b                      pop    %ebx
   bf005:       3d 01 f0 ff ff          cmp    $0xfffff001,%eax
   bf00a:       73 01                   jae    bf00d <__libc_read+0x1d>
   bf00c:       c3                      ret
[snipped]

The "int $0x80" instruction raises the system call whose number is contained
in the eax register, 3 in our case. The include/asm-i386/unistd.h file lists
all system call numbers:

====== excerpt from include/asm-i386/unistd.h ======
#define __NR_exit		  1
#define __NR_fork		  2
#define __NR_read		  3
#define __NR_write		  4
#define __NR_open		  
====================================================

Therefore, the process keeps executing the read() system call service
routine. It is just a bit of information, but it is better than nothing.
However, to effectively trace this kernel bug we need some help from the
kernel itself.

The Linux kernel doesn't include a self-debugger, as some other Unix
kernels do. However, it can be compiled in such a way to make kernel
debugging much easier. When defining the compilation options, you are
asked whether you want kernel debugging support:

$ make config
[snipped]
Kernel debugging (CONFIG_DEBUG_KERNEL) [Y/n/?]
  Debug high memory support (CONFIG_DEBUG_HIGHMEM) [N/y/?]
  Debug memory allocations (CONFIG_DEBUG_SLAB) [N/y/?]
  Memory mapped I/O debugging (CONFIG_DEBUG_IOVIRT) [N/y/?]
  Magic SysRq key (CONFIG_MAGIC_SYSRQ) [Y/n/?]
  Spinlock debugging (CONFIG_DEBUG_SPINLOCK) [N/y/?]
  Verbose BUG() reporting (adds 70K) (CONFIG_DEBUG_BUGVERBOSE) [N/y/?]

You should always compile "Magic SysRq key" support when hacking the
kernel. In my opinion, you should compile it even for production kernels,
because the magic keys might save your system integrity in nasty situations.
If you are worried about the security implications, you may temporarily
disable the magic keys by issuing as root:

# echo "0" > /proc/sys/kernel/sysrq

(In earlier kernel versions magic keys were disabled by default, and
before using them you had to enable them.)

A kernel with support for the magic keys answers to specific combinations of
key presses starting with <SysRq> (usually <Alt>+<PrintScreen> on PCs). 
For any key press combination, the kernel performs a specific action, usually
writing the result in a kernel message.

For instance, the <SysRq>+<h> combination produces a short list of allowed
magic keys combination. If you don't see the message, try to dump the
kernel messages buffer with the dmesg command:

(<SysRq>+<h> pressed)
$ dmesg
[snipped]
SysRq : HELP : loglevel0-8 reBoot tErm kIll saK killalL showMem Off showPc
unRaw Sync showTasks Unmount 
               
Each capitalized letter represents a key press combination. For instance,
<SysRq>+<B> reboots the machine without syncing or unmounting the disks. Be
careful!

To our purposes, the <SysRq>+<T> magic key is invaluable. It shows some living
information on all processes in the system:

(<SysRq>+<T>)
$ dmesg
[snipped]
SysRq : Show State

                         free                        sibling
  task             PC    stack   pid father child younger older
init          S C131FF2C  4868     1      0   200       4       (NOTLB)
Call Trace: [<c0113a5b>] [<c01139a0>] [<c013bdc1>] [<c013c160>] [<c0106b0b>]
[snipped]
bash          S 7FFFFFFF     4   222    220                     (NOTLB)
Call Trace: [<c01139ff>] [<c01688dd>] [<c0164a88>] [<c012fdb6>] [<c0106b0b>]
mystery4      R current   2656   232    203                 206 (NOTLB)
Call Trace: [<c0113c55>] [<cd0530c9>] [<c01472ba>] [<c012fdb6>] [<c0106b0b>]
[snipped]

As you see, our process is marked as running "R" and in execution when the 
magic key was pressed. The "Call Trace" line shows the linear addresses stored
in the Kernel Mode stack of the process. As usual, we must resolve them by
using the Symbol.map file (for statically linked kernel functions) and the
/proc/ksyms virtual file (for dynamically loaded modules:

0xc0113c55:  function schedule()       at offset 0x1c9 = 457
0xcd0530c9:  module   hack4            at offset 0x69  = 105
0xc01472ba:  function proc_file_read() at offset 0xf2  = 242
0xc012fdb6:  function sys_read()       at offset 0x96  = 150
0xc0106b0b:  function system_call()    at offset 0x33  = 51

Of course, pressing again the <SysRq>+<T> magic key might produce a different
output, depending on the code being executed when the keyboard interrupt
occurred. In any case, the output gives us the information we were looking
for.

The code at offset 457 of schedule() is 

$ gdb linux-2.4.17-lkhc/vmlinux
GNU gdb 5.0
Copyright 2000 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-slackware-linux"...(no debugging symbols found)...
(gdb) disassemble schedule
Dump of assembler code for function schedule:
0xc0113a8c <schedule>:  mov    $0xffffe000,%eax
0xc0113a91 <schedule+5>:        and    %esp,%eax
0xc0113a93 <schedule+7>:        push   %ebp
0xc0113a94 <schedule+8>:        mov    %esp,%ebp
0xc0113a96 <schedule+10>:       sub    $0x18,%esp
[snipped]
0xc0113c4e <schedule+450>:      mov    %eax,0x34(%ebx)
0xc0113c51 <schedule+453>:      sti
0xc0113c52 <schedule+454>:      cmp    %ebx,0xfffffff8(%ebp)
0xc0113c55 <schedule+457>:      jne    0xc0113c60 <schedule+468>
0xc0113c57 <schedule+459>:      mov    0xfffffff8(%ebp),%ebx
0xc0113c5a <schedule+462>:      jmp    0xc0113d46 <schedule+698>
[snipped]

Ehm, nothing very useful here. It looks a regular invocation of the schedule()
function. Let's try with the module code.

$ objdump -d linux-2.4.17-lkhc/lkhc/hack4.o
[snipped]
0000000000000048 <hack4_read>:
  48:   8b 54 24 0c             mov    0xc(%esp,1),%edx
  4c:   a1 00 00 00 00          mov    0x0,%eax
  51:   50                      push   %eax
  52:   68 06 00 00 00          push   $0x6
  57:   03 54 24 0c             add    0xc(%esp,1),%edx
  5b:   52                      push   %edx
  5c:   e8 fc ff ff ff          call   5d <hack4_read+0x15>
  61:   83 c4 0c                add    $0xc,%esp
  64:   e8 fc ff ff ff          call   65 <hack4_read+0x1d>
  69:   eb f9                   jmp    64 <hack4_read+0x1c>
  6b:   90                      nop
  6c:   8d 74 26 00             lea    0x0(%esi,1),%esi
 
This looks interesting. The instructions at offset 0x64 and 0x69 are an
endless loop. To have a confirm, we checks the source file:

$ cat linux-2.4.17-lkhc/lkhc/hack4.c
[snipped]
static int hack4_read(char *page, char **start, off_t off, int count, int *eof, void *data)
{
       int written;
 
       written = sprintf(page+off, "jiffies=%ld ", jiffies);
       for(;;) schedule();
       return written;
}
[snipped]

As usual in our toy examples, the source of the problem is in the reading
function of some special file. In this case, the function continously invokes
schedule() without terminating. Of course, finding bugs in the real world is
much more complicated, but the techniques shown here are always the basis of
the bug hunting procedure.


Example 5. Adding a new Magic Key.

Although we discovered what the "mystery4" process is doing, we cannot destroy
it without rebooting the system. That's quite boring. We are supposed to be
kernel hacker, don't we? Thus, let us write a simple module that, once loaded,
allows us to force a given "victim" process to generate a kernel oops and
die.

Let's call the new module "genoops". It should add new Magic Key 'd': when the
user presses <SysRq>+<d>, the kernel answers forcing an oops as soon as the
victim process goes in execution. The PID of the victim process is passed as
parameter when the module is loaded.

From the previous lectures we already know the main steps required to write a
new module, so I'll just quickly sketch them.


STEP 1. 

We create entries for the new feature in arch/i386/config.in, and in the
lkhc/Makefile (where our source code files are).

====== excerpt from arch/i386/config.in ======================================
   if [ "$CONFIG_LKHC" != "n" ]; then
      tristate '      Linux Kernel Hacking Course hack #4' CONFIG_LKHC_HACK4
      tristate '      Kernel OOPS generator' CONFIG_LKHC_GENOOPS
   fi
==============================================================================

====== excerpt from lkhc/Makefile ============================================
obj-$(CONFIG_LKHC_GENOOPS) += genoops.o   
==============================================================================

STEP 2.
    
We create a skeleton source code lkhc/genoops.c file for the module:

===== excerpt from lkhc/genoops.c ============================================
#include <linux/module.h>
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/sched.h>

static int __init genoops_init(void)
{
    return 0;
}

static void __exit genoops_exit(void)
{

}

EXPORT_NO_SYMBOLS;
module_init(genoops_init);
module_exit(genoops_exit);
 
MODULE_AUTHOR("Marco Cesati");
MODULE_DESCRIPTION("LKHC Oops generator");
MODULE_LICENSE("GPL");
==============================================================================

This code should already produce a loadable module.

STEP 3.

We must register a new Magic Key. In order to do this, we include the
include/linux/sysrq.h header file, then we allocate a new struct sysrq_key_op
data structure:

========== excerpt from lkhc/genoops.c =======================================
...

#include <linux/sysrq.h>

static struct sysrq_key_op sysrq;
static char magickey = 'd';
static char *help = "Dump";
static char *action = "Forcing oops (LKHC's genoops)";
===============================================================================

We also allocate a few static strings that store the our magic key, its name
(as shown in <SysRq>+<H>), and a message shown when the magic key is pressed.

STEP 4. 

We must take care of the victim process PID.  Let's define a "victim" variable
that is also a parameter of the module:

========== excerpt from lkhc/genoops.c ========================================

...
static pid_t victim = 0;
...

MODULE_AUTHOR("Marco Cesati");
MODULE_DESCRIPTION("LKHC Oops generator");
MODULE_LICENSE("GPL");
MODULE_PARM(victim, "i");
MODULE_PARM_DESC(victim, "PID of victim process");
===============================================================================

STEP 5.
 
We initialize the struct sysrq_key_op data structure, and we register
it in genoops_init():

============ excerpt from lkhc/genoops.c ======================================
static int __init genoops_init(void)
{
    int ret;
    
    if (!victim) {
        printk(KERN_INFO "genoops module: parameter \"victim\" required\n");
        return 1;
    }
    
    sysrq.help_msg = help;
    sysrq.action_msg = action;
    sysrq.handler = handle_genoops;
    ret = register_sysrq_key(magickey, &sysrq);
    if (ret)
        printk(KERN_INFO "genoops module: loading failed (%d)\n", ret);
    else
        printk(KERN_INFO "genoops module: loading ok (victim PID=%d)\n",
            victim);
    return ret;
}
===============================================================================

Note that the function checks whether the user has specified a valid PID for
the victim process. (Returning 1 aborts the module loading operation.)

The handle_genoops() function is invoked whenever a user presses the new magic
key. We'll describe it later.

STEP 6. 

Now we must take care of module unloading, so let's customize the
genoops_exit() function:

========== excerpt from lkhc/genoops.c ========================================
static void __exit genoops_exit(void)  
{
    if (unregister_sysrq_key(magickey, &sysrq))
        printk(KERN_INFO "genoops module: error while unloading!\n");
    else
        printk(KERN_INFO "genoops module: unloading ok\n");
}
===============================================================================

Actual unregistering is done through the unregister_sysrq_key() function,
passing to it both the magic key and the sysrq_key_op data structure. (Yes,
kernel hackers are quite paranoid. The function refuses to unregister the
magic key if the data structure doesn't correspond to it.)

STEP 7. 

How can we kill the victim process? The most simple method consists
of forcing the victim to execute a faulty code, for instance a code that
dereferences a NULL pointer. Let us define a minimal function that does this:

========== excerpt from lkhc/genoops.c ========================================
    
static void __oops_code(void) { *(unsigned long *)0 = 0; }

===============================================================================

The "NULL pointer dereference" fault will also cause the kernel to generate
a useful oops message. 

STEP 8.

Now we should write our magic key handler. Return type and parameters of
the handler are defined by the kernel API as follows:

============ excerpt from lkhc/genoops.c ======================================
static void handle_genoops(int key, struct pt_regs *pt_regs,
                struct kbd_struct *kbd, struct tty_struct *tty)
{

}
===============================================================================

We don't care too much about the parameters right now.  They represent
the ASCII code of the magic key, and pointers to several data structures
related to the process that is in execution when the user presses the
magic key.  

STEP 8.a 

The first task of the handler consists of deriving from the PID of the victim
the address of the corresponding process descriptor:


========== excerpt from lkhc/genoops.c ========================================

    struct task_struct *ts;
    ts = find_task_by_pid(victim);
    if (!ts) {
        printk(KERN_INFO "genoops module: victim (PID %d) not found\n",
            victim);
        return;
    }

===============================================================================

The find_task_by_pid() function searches in a hash table the address of the
descriptor whose PID is passed as parameter. If the search fails, the victim
process doesn't exist, and the module issues a warning message and returns.


STEP 8.b.

Using our module can be quite dangerous for the system safety. At least,
our module should refuse to kill the kernel threads. Any kernel thread is
recognizable by not having a custom memory descriptor:

========== excerpt from lkhc/genoops.c ========================================
    
    if (!ts->mm) {
        printk(KERN_INFO "genoops module: the victim is a kernel thread\n");
        return;
    }
    
===============================================================================

STEP 8.c.

Ok, the victim must die. How can we force it to execute the __oops_code()
function? That's simple if the victim is *not* the process currently in
execution:

========== excerpt from lkhc/genoops.c ========================================
    
    if (ts != current) {
        ts->thread.eip = __oops_code;
        return;
    }
===============================================================================
 
We must defer the execution of __oops_code(), otherwise we would kill an
innocent process. In particular, we must wait until the scheduler selects again
the victim process for execution.

When a process is not in execution the address of the code that it was last
executing is saved in the thread.eip field of the process descriptor. Thus, the
handler might limit itself to rewrite such field with the address of our
__oops_code() function. When the victim is selected by the scheduler, the
kernel forces the value in ts->thread.eip into the instruction pointer
register, thus a jump is made to __oops_code().


STEP 8.d

If the victim is the process currently in execution, we cannot directly invoke
the __oops_code() function. In fact, the magic key handler is invoked
by the interrupt handler of the keyboard, and generating an oops right now
will likely crash the whole system.

There are two cases, depending on whether the process was interrupted in Kernel
Mode or in User Mode.

========== excerpt from lkhc/genoops.c ========================================
    
    if ((regs->eflags & VM_MASK) || (3 & regs->xcs)) {
        force_sig(SIGABRT, ts);
        return;
    }
            
===============================================================================

In this case the victim was interrupted in User Mode. It doesn't make
sense to generate an oops. Rather, we kill the process by sending a
SIGABRT signal (this signal might also generate a core dump of the User
Mode execution state of the victim).


STEP 8.e

The last case occurs when the victim was interrupted in Kernel Mode.
The handler modifies the return address of the keyboard interrupt handler,
which is saved on the Kernel Mode stack and pointed to by the regs->eip 
field.

========== excerpt from lkhc/genoops.c ========================================

    regs->eip = __oops_code;
            
===============================================================================


Example 6. Killing the "mystery4" process with the genoops module.

Now it's time to compile and try our new module. We shall use it to kill
the boring "mystery4" process (PID 232):

# insmod lkhc/genoops.o victim=232

(<SysRq>+<D> pressed)

# dmesg
[snipped]
SysRq : Forcing oops (LKHC's genoops)
Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
cd053060
*pde = 00000000
Oops: 0002
CPU:    0
EIP:    0010:[<cd053060>]    Not tainted
EFLAGS: 00013246
eax: 00000001   ebx: c32f2000   ecx: 00000000   edx: 00000019
esi: 00000000   edi: 00000019   ebp: c32f3f58   esp: c32f3f34
ds: 0018   es: 0018   ss: 0018
Process mystery4 (pid: 784, stackpage=c32f3000)
Stack: c96fbde0 00000080 c52c6000 c32f2000 c0220000 c8689d20 00000000 c32f2000
       c025b540 00000080 cd0550c9 c01472ba c52c6000 c32f3f98 00000000 00000080
       c32f3f94 00000000 c96fbde0 ffffffea 00000000 00000080 c06bfce0 00000000
Call Trace: [<cd0550c9>] [<c01472ba>] [<c012fdb6>] [<c0106b0b>]
 
Code: c7 05 00 00 00 00 00 00 00 00 c3 90 53 8b 5c 24 0c 8b 0d 0c

Notice that we obtained a call trace that is almost identical to the call trace
produced in first example of this lecture. However, the keyboard interrupt
occurred in the hack4 module code, rather than in the schedule() function.

CAVEATS: For the sake of simplicity, this module works only on uniprocessor 
systems. It may easily crash your SMP system. And actually, it is not safe on
UP systems either, because the magic key handler doesn't check for nested
interrupt occurrences. You have been warned!