Intercept Syscalls with Rust and Ptrace

Table of Contents

How to use Rust with ptrace to intercept and modify syscalls?

Why?
#

Recently I had to check the correctness of a Java program relying on msync for durability. This required to simulate, from outside the JVM, that the calls fail. So the task is to find a way to have msync syscalls fail for certain mapped files.

Alternatives
#

Instead of writing this from scratch, one could use one of these alternatives:

strace allows for tampering with syscalls. However, the fault injection is not granular enough to easily have syscalls fail for a subset of files.
LD_PRELOAD could be used to replace the actual msync
eBPF does allow you to trace syscalls to which it apparently has read-only access. So without some serious stunts, we can’t modify the return values easily. We could use seccomp in conjunction with SECCOMP_RET_TRACE to selectively intercept only a portion of syscalls with a tracer to improve performance, but that’s out of scope.
reverie is open source library to intercept syscalls on linux. This one does the job better than I will do in this blogpost obviously.

Ptrace basics
#

ptrace allows you to have your program, the tracer to observe and control the execution of another program, the tracee. Once we attach a tracer to a tracee, the tracer can be notified of some events happening in the tracee and modify the tracee state. For example, we could wait for the tracee to do a syscall, log that and then do whatever we want with that invocation: Block it, modify it’s arguments, delay it, inject an error, although it went through etc. This is what we will be doing. We will ignore most other functionality of ptrace for now.

Testee v1
#

Lets write a testee that will do msync in a loop, that will be our tracee. We will later switch to the JVM but let’s start with something more sane. In my last post, I explained how msync works in-depth, so this part will be concise here.

The following crates will be used:

nix for fallocate to quickly resize a test file to our desired size.
memmap2 for an easier mmap abstraction
log and env_logger for … logging. Since tracer and tracee will share stdout later, it’s quite nice to be able to tell them apart.
anyhow to have nicer ergonomics around errors

First, we print the pid so we can correlate that with the tracer and then open a file (and create it if it does not exist).

info!("Testee running with pid {}", process::id());

let file = OpenOptions::new()
    .write(true)
    .read(true)
    .create(true)
    .truncate(true)
    .open("file.bin")
    .context("Could not open file")?;

Then we zero out the file in the region we want to map. Now it can be mapped into memory.

fallocate(
    file.as_raw_fd(),
    FallocateFlags::FALLOC_FL_ZERO_RANGE,
    0,
    BUFFER_LEN,
)
.context("Could not resize file")?;

let mut mmap = unsafe {
    MmapOptions::new()
        .len(BUFFER_LEN as usize)
        .map_mut(&file)
        .context("Could not map file")?
};

Easy enough. Let’s write and flush it in a loop:

for _ in 0..10 {
    match flush_mmap(&mut mmap) {
        Ok(_) => info!("Testee done"),
        Err(e) => {
            error!("Mmap failed: {:?}", e)
        }
    }
    thread::sleep(Duration::from_millis(50));
}

Note that we only print the error and don’t handle it in any way.

The flush itself is trivial

fn flush_mmap(mmap: &mut MmapMut) -> Result<()> {
    mmap[0] = 42;
    mmap.flush().with_context(|| "Could not flush file")?;
    Ok(())
}

Here, the flush() internally invokes msync.

After running, the file has the expected contents:

➜ hexyl file.bin
┌────────┬─────────────────────────┬────────┐
│00000000│ 2a 00 00 00 00 00 00 00 │*⋄⋄⋄⋄⋄⋄⋄│
│00000008│ 00 00 00 00 00 00 00 00 │⋄⋄⋄⋄⋄⋄⋄⋄│
│*       │                         │        │
│00a00000│                         │        │
└────────┴─────────────────────────┴────────┘

It’s time to trace this one.

Basic tracing
#

You can find the full source code on github, so I will only walk you through the important parts.

Attaching to the tracee
#

The tracer needs to attach to the tracee somehow. While we could run the tracee first and then attach to it by pid, there is an easier way: We can fork and execute the tracee in the child process. That makes running the whole thing way more ergonomic.

In Rust, this can easily be done with the nix::unistd::fork() function.

match unsafe { fork() } {
    Ok(ForkResult::Child) => run_child(),        
    Ok(ForkResult::Parent { child }) => run_parent(child),
    Err(e) => panic!("Could not fork main process: {}", e),
};

Note that fork() is unsafe for reasons not relevant here.

The run_child will be simple: it requests to be traced and then simply executes the tracee process with exec. This does not change the PID, so the parent (that is passed the child variable) can use that pid to interact with the tracee.

fn run_child() {
    info!("havoc child process executing as {}", process::id());
    // the pid won't change with exec, so we ask to be traced
    ptrace::traceme().expect("OS could not be bothered to trace me");
    let e = Command::new("./target/release/testee").exec();
    unreachable!("Exec failed, this process should be dead: {e}")
}

Here, we can finally make use of the unreachable! macro: This process should be replaced with the child. Emphasis on replaced: With a successful exec, the remaining instructions are overwritten and can’t be executed. However, I do log the error for completeness.

Now for the parent process. Let’s have a look at it on a per-function basis: We wait for the child process to be ready (before it starts the testee execution), this will be a Stopped(Pid(...), SIGTRAP), since we asked in the child to be traced. Then we set up exactly what we want to trace and allow the tracee to run to the next syscall with trace_syscall once. In a loop we then wait for a signal from the child and tell it to continue.

fn run_parent(pid: Pid) {
    // wait for our child process to be ready
    let ws = wait().expect("Parent failed waiting for child");
    info!("Child process ready with signal: {ws:?}, will ask it to continue untill syscall");

    setup_tracing(pid).expect("Parent failed to set up tracing");
    trace_syscall(pid, None).expect("Parent failed tracing");

    let mut msync_counter = 0;
    loop {
        match wait_for_signal(&mut msync_counter) {
            Ok(_) => { /* nop */ }
            Err(e) => {
                match e {
                    //...
                }
                break;
            }
        }
    }
}

Note that we break out of the loop on error. Further, we can ignore the setup_tracing function for now.

Waiting for signals
#

Here’s the first version of the wait function, called in a loop. In case the child stops, we will handle it accordingly. Should waiting return an error, we exit the loop. In all other cases, we log something but continue.

fn wait_for_signal(msync_counter: &mut i32) -> Result<(), HavocError> {
    match wait() {
        Ok(WaitStatus::Stopped(pid_t, sig_num)) => {
            handle_child_stopped(sig_num, pid_t, msync_counter)
        }

        Ok(WaitStatus::Exited(pid, exit_status)) => {
            debug!("Child with pid: {} exited with status {}", pid, exit_status);
            Ok(())
        }

        Ok(status) => {
            warn!("Received unhandled wait status: {:?}", status);
            Ok(())
        }

        Err(err) => {
            error!("An error occurred: {:?}", err);
            Err(HavocError::Wait)
        }
    }
}

Now what to do if the child stopped? We tell it to continue with trace_syscall. We pass the signal to the tracee in all cases but SIGTRAP.

fn handle_child_stopped(
    sig_num: Signal,
    pid_t: Pid,
    msync_counter: &mut i32,
) -> Result<(), HavocError> {
    match sig_num {
        Signal::SIGTRAP => {
            handle_sigtrap(pid_t, msync_counter)?;
            trace_syscall(pid_t, None)
        }
        Signal::SIGSTOP => trace_syscall(pid_t, Some(Signal::SIGSTOP)),
        // ... some corner cases like SIGWINCH
        _ => trace_syscall(pid_t, Some(sig_num)),
    }
}

JVM… The JVM requires us to handle SIGSEGV here: It expects to receive a segmentation violation on null dereference to update garbage collected pointers. Interesting…

Now er have the first interesting functionality: We want to find out what caused the sigtrap. If it was our sought-after msync syscall, we handle it. Note that we log entry but handle it only on exit. This means the msync call is actually reaching the kernel.

fn handle_sigtrap(pid_t: Pid, msync_counter: &mut i32) -> Result<(), HavocError> {
    let regs = ptrace::getregs(pid_t).map_err(HavocError::Register)?;
    if regs.orig_rax == SYS_msync as u64 {
        if regs.rax == -ENOSYS as u64 {
            info!("Entry of syscall in {pid_t} : {}", regs.orig_rax);
        } else {
            info!("Exit of syscall in {pid_t} : {}", regs.orig_rax);
            handle_msync(msync_counter, regs, pid_t)?;
        }
    } 
    Ok(())
}

Here, the regs are the registers of the tracee. How cool is that! Just imagine what terrible things we could do with such power… Anyhow, the orig_rax (rax before overwriting it) will be set to SYS_msync, which is 26 if the syscall is for msync. Note that this function will be called for each and every syscall, so we waste quite a bit of performance but that’s fine for now.

To distinguish entry from retrun of the call (we will be woken up for both), we can use the actual rax, that will be set to -ENOSYS, according to the ptrace manpage (they advise against it but who will stop me?).

So now that we can filter out msync calls, let’s mess them up!

Messing up msync
#

fn handle_msync(
    msync_counter: &mut i32,
    mut regs: nix::libc::user_regs_struct,
    pid: Pid,
) -> Result<(), HavocError> {
    let addr = regs.rdi;

    let proc = Process::new(pid.as_raw()).map_err(HavocError::Proc)?;
    let mappings = proc.maps().map_err(HavocError::Proc)?;

    let map = mappings
        .iter()
        .find(|m| m.address.0 <= addr && m.address.1 >= addr);

    match map {
        Some(map) => match &map.pathname {
            procfs::process::MMapPath::Path(p) => {
                info!("Found map: {:?}", p);
            }
            e => warn!("Did not implement path type: {:?}", e),
        },
        None => todo!(),
    }

    // see also  https://github.com/strace/strace/blob/master/src/linux/x86_64/set_error.c
    regs.rax = -(Errno::ENOANO as i32) as u64;
    ptrace::setregs(pid, regs).unwrap();
    Ok(())
}

So here’s quite a bit to unpack:

We use the value from rdi, which contains the address argument of the msync call.
We then use the procfs to fetch the mappings of the tracee.
Of all mappings, we select the one that contains the address we got earlier.
This mapping is then logged, but we could also decide whether to mess with that specific file or if it’s ok.
We then set rax to contain the desired error, in this case ENOANO. That one won’t occur in the wild, so we can be sure that if we see it, we caused it.

A first test
#

Running the code as-is, does work fine.

➜  cargo run --release --bin havoc
   Compiling havoc v0.1.0 (.../havoc)
    Finished `release` profile [optimized + debuginfo] target(s) in 0.43s
     Running `target/release/havoc`
[2024-08-23T10:57:37Z INFO  havoc] havoc started with pid 1868587
[2024-08-23T10:57:37Z INFO  havoc] havoc child process executing as 1868632
[2024-08-23T10:57:37Z INFO  havoc] Child process ready with signal: Stopped(Pid(1868632), SIGTRAP), will ask it to continue untill syscall
[2024-08-23T10:57:37Z INFO  testee] Testee running with pid 1868632
18446744073709551578
[2024-08-23T10:57:37Z INFO  havoc] Entry of syscall in 1868632 : 26
[2024-08-23T10:57:37Z INFO  havoc] Exit of syscall in 1868632 : 26
[2024-08-23T10:57:37Z INFO  havoc] Detected msync # 1
[2024-08-23T10:57:37Z INFO  havoc] Found map: "..../file.bin"
[2024-08-23T10:57:37Z ERROR testee] Mmap failed: Could not flush file

    Caused by:
        No anode (os error 55)

Nice, this works.

However, when I tried it with the actual testee written in java, it just continued running, without stopping at all. Turns out, ptrace is working on a per-thread basis, and we need to trace child threads and processes of our testee as well.

Fork, clone, and friends
#

In the repo, there is a test program called “forker”, that forks and then also spawns a thread in which the actual action happens. To trace that, a bit more code was needed:

/// Setup the preace options to also trace fork, clone and vfork
fn setup_tracing(pid: Pid) -> Result<(), HavocError> {
    ptrace::setoptions(
        pid,
        Options::PTRACE_O_TRACEFORK
            .union(Options::PTRACE_O_TRACECLONE)
            .union(Options::PTRACE_O_TRACEVFORK),
    )
    .context("Could not set options to follow forks")
    .map_err(HavocError::Ptrace)
}

This tells ptrace that we want to trace children of the tracee too. Further, now one of the earlier “edge case” omissions is important: What happens if a child exits. In my first version, this caused the code to terminate. After all, why continue tracing if the tracee exited. After fiddling with this corner-case for some case, I decided it does make sense to trace the rest of them regardless. Further, since a child spawned by the tracee can also spawn children, we need to tell ptrace to also track that.

This makes the wait_for_signal code a bit more complex. Note that the Exited match arm does not call trace_syscall again.

Ok(WaitStatus::Exited(pid, exit_status)) => {
    debug!("Child with pid: {} exited with status {}", pid, exit_status);
    Ok(())
}

Ok(WaitStatus::PtraceEvent(pid, Signal::SIGTRAP, _)) => {
    debug!("PtraceEvent - SIGTRAP for: {pid} ");
    setup_tracing(pid)?;
    trace_syscall(pid, Some(Signal::SIGTRAP))?;
    Ok(())
}

Now the forker code is also correctly traced. And now it does also work for the JVM!

Findings
#

So what did I find? The test code actually ignored msync failures. This would be quite bad in production.

It is worth noting though, that the JVM does pass these errors through correctly:

Exception in thread "main" java.io.UncheckedIOException: java.io.IOException: No anode (msync with parameter MS_SYNC failed)
	at java.base/java.nio.MappedMemoryUtils.force(MappedMemoryUtils.java:102)
	at java.base/java.nio.Buffer$1.force(Buffer.java:839)
	at java.base/jdk.internal.misc.ScopedMemoryAccess.forceInternal(ScopedMemoryAccess.java:337)
	at java.base/jdk.internal.misc.ScopedMemoryAccess.force(ScopedMemoryAccess.java:325)
	at java.base/java.nio.MappedByteBuffer.force(MappedByteBuffer.java:309)
	at java.base/java.nio.MappedByteBuffer.force(MappedByteBuffer.java:250)
	at dev.amann.msync_test.App.main(App.java:19)
Caused by: java.io.IOException: No anode (msync with parameter MS_SYNC failed)
	at java.base/java.nio.MappedMemoryUtils.force0(Native Method)
	at java.base/java.nio.MappedMemoryUtils.force(MappedMemoryUtils.java:100)
	... 6 more

Thanks for reading! If you want to know more about what the ENOANO error code was originally intended for, continue here.

Cover photo by Fabrizio Conti on Unsplash

Why? #

Alternatives #

Ptrace basics #

Testee v1 #

Basic tracing #

Attaching to the tracee #

Waiting for signals #

Messing up msync #

A first test #

Fork, clone, and friends #

Findings #