Microkernel IPC Without the Performance Tax

June 22, 2026

Mach was 2× slower than monolithic Unix. The overhead came from context switches, data copies, and lock contention on shared channels. We built an IPC path that avoids all three. The result: 150-cycle inter-process communication on the data plane, with proof-carrying security guarantees on every descriptor.

The microkernel promise is seductive: a minimal kernel that manages only address spaces and IPC, with every service—filesystems, networking, device drivers—running as an unprivileged user-space process. Fault isolation, clean interfaces, no sprawling kernel monster. The academic literature has called this the right architecture for decades.

The microkernel reality, historically, has been disappointing. Mach, the most famous example, was consistently 1.5–2× slower than monolithic Unix on the same hardware. The overhead was not a fixed cost you could optimize away once. It was structural: every operation that crossed from one service to another required a context switch, a data copy through kernel memory, and synchronization on shared channel state. Each of these is expensive on its own. Together, they made microkernel IPC untenable for performance-sensitive workloads.

KeuOS is a Salt-based microkernel that does not accept this trade. The IPC path is built on a three-optimization stack that eliminates each of the three costs independently, and the combination brings the data plane to within 150 cycles—competitive with L4 and within arm's reach of a direct function call.

The three costs

Understanding why traditional microkernel IPC is expensive means understanding three distinct costs that compound on every operation.

Context switch. Crossing from Ring 3 to Ring 0 and back costs roughly 1,000 cycles on modern x86-64 hardware. The syscall/sysret pair itself is tight (~80 cycles), but the kernel must save and restore registers, switch page tables, flush TLB entries, validate the request, route it to the destination, and schedule the recipient. By the time two sysret instructions have executed and the data has arrived, a thousand cycles have passed.

Data copy. Even with a shared buffer, traditional IPC copies data at least once: the sender writes into kernel memory, the kernel reads and rewrites into the receiver's address space. memcpy of a 4 KiB buffer is ~200 cycles on modern silicon. The more data you move, the more this dominates.

Lock contention. Shared IPC channels require synchronization. A spinlock or mutex around the shared ring means every send and receive touches an atomic instruction that bounces the cache line between cores. Under contention, the lock cmpxchg failure path costs hundreds of cycles in retry and coherence traffic.

These three costs are additive. A traditional microkernel IPC that context-switches twice, copies 4 KiB through a kernel bounce buffer, and acquires a shared lock is spending 1,000 + 200 + 200 = 1,400 cycles before the receiver has even looked at the data. For a message of a few hundred bytes, the overhead swamps the work.

No trap, no copy, no lock

The KeuOS IPC design eliminates these three costs by refusing the architectural assumptions that cause them. Each elimination is independent; together, they form a coherent whole.

No trap. The data plane runs through single-producer, single-consumer (SPSC) rings mapped into the sender and receiver address spaces. A send is a store instruction. A receive is a load instruction. There is no system call on the data path. The kernel participates only during setup and teardown: it allocates the ring page, maps it into both address spaces, and establishes the producer-consumer relationship. After that, the hardware does the work.

The ring is implemented as a fixed-size byte array with two indices: head (written by the producer, read by the consumer) and tail (written by the consumer, read by the producer). The producer writes data at the head offset and advances head. The consumer reads from tail and advances tail. The indices are plain integers, not atomics. Because there is exactly one writer and one reader, there is no race.

struct SpscRing {
    buf: [u8; 4096],      // ring buffer, 4 KiB
    head: u16 addr(0x1000),  // producer writes here
    tail: u16 addr(0x2000),  // consumer writes here
}

fn produce(ring: &mut SpscRing, data: &[u8])
    requires(data.len() <= 0x1000)
{
    let h = ring.head;
    let wrapped = h + data.len();
    let avail = ring.tail + 0x1000 - h;
    requires(avail >= data.len());   // space available
    memcpy(ring.buf[h..wrapped], data);
    ring.head = wrapped & 0xFFF;       // no sync needed
}

The compiler proves the requires clauses statically in most cases. Where it cannot, a runtime assertion catches overflow. Either way, the data path carries no branches from the kernel.

No copy. The ring page is registered with the IOMMU. When a network device writes a received packet, it writes directly into the ring page via DMA. The producer (in this case, the NIC) places the data at the head offset in main memory, and the consumer reads from the same physical page. Nothing moves the bytes a second time. The ring is cache-hot for both sides because head and tail advance predictably.

For inter-process messages that do not originate from hardware—say, a filesystem server sending a directory listing to a client—the sender writes directly into the consumer-mapped page. The receiver reads the data from the same cache line the sender just wrote. On the same core (or a core sharing an L2), the data arrives in L1 before the producer instruction retires.

No lock. Because the channel is SPSC by construction, there is no contention. Head is written by the producer and read by the consumer. Tail is written by the consumer and read by the producer. They occupy separate cache lines (the struct places them at offsets 0x1000 and 0x2000, 4 KiB apart). A producer that writes head does not invalidate the consumer's cache line containing tail. A consumer that writes tail does not invalidate the producer's cache line containing head. The two pointers can advance in parallel on different cores with zero coherence traffic for the index variables.

The combination of these three—no trap, no copy, no lock—produces a data path that is structurally indistinguishable from a memory-to-memory copy in the same address space. The overhead is the cost of the loads and stores themselves, plus a single branch for the ring-full check.

Proof-carrying IPC

Eliminating the kernel from the data path creates a security problem. In a traditional microkernel, every message passes through kernel memory, where the kernel can validate descriptors, check capabilities, and enforce access control. If the kernel never touches the data, how do you prevent a malicious sender from forging a descriptor that reads kernel memory?

KeuOS answers this with three mechanisms that together form a proof-carrying IPC layer: the sender provides a cryptographic proof that the kernel can verify without touching the ring data.

SipHash-2-4 proof hints. Each IPC descriptor carries a 64-bit authentication tag computed over the descriptor fields using SipHash-2-4 with a per-channel key established during ring setup. The tag is embedded in the descriptor by the compiler at the call site—it is not computed at runtime. The kernel verifies the tag when the sender signals a new message (the one trap that remains in the IPC path: a single syscall to notify the receiver). Tag verification is ~15 cycles on modern x86-64 using the hardware CRC extension that SipHash can leverage. A mismatched tag means the kernel drops the message and kills the sender.

Because the key is per-channel and established during setup, a compromised sender on ring A cannot forge descriptors for ring B. The tag is not a signature in the cryptographic sense—there is no asymmetric key, no PKI—but it is sufficient for an in-kernel verifier that checks the tag before routing the notification.

Alignment gate. Every ring is aligned to a 64-byte cache line boundary. More importantly, the control structures (head, tail, and the capability table) are placed on separate cache lines that fall in the kernel's protected page. The data portion of the ring is mapped into userspace; the control portion is not. A sender cannot modify head or tail values arbitrarily because those fields live in kernel memory. The only thing a sender can write is payload bytes into the data region, and those bytes cannot change the ring's invariants.

MMU gate. The Ring 3 / Ring 0 boundary is absolute. The SPSC ring's data page is mapped into userspace with read-write permission, but the page table entry is marked non-executable and the surrounding pages are unmapped. A userspace process cannot jump into the ring, cannot read kernel memory by corrupting pointers in the ring, and cannot escalate privileges through the IPC path. The proof is enforced by the MMU, not by software. A kernel bug in the IPC setup might map the wrong page, but the MMU gate makes privilege escalation through the data plane impossible regardless of bugs in the sender or receiver logic.

These three gates—SipHash, alignment, MMU—cost roughly 25 cycles on the notification path (the one syscall per message). The data plane, which carries the bulk of the work, remains at the 150-cycle baseline.

Benchmark: 150 cycles

We measured round-trip IPC latency on x86-64 (Intel i7-13700K, 5.4 GHz boost, mitigations enabled, one core dedicated to each process, direct message of 64 bytes):

ConfigurationCycles (median)Vs. KeuOS
Direct function call (same process)12
KeuOS SPSC (data plane only)15012.5×
KeuOS with proof verification17514.6×
Linux pipe (2 processes)61050.8×
Linux eventfd + signalfd1,04086.7×
Linux AF_UNIX SOCK_DGRAM1,280106.7×
Mach IPC (published, 1989 era)~2,000~167×

The data plane is 150 cycles: two stores (producer writes payload and advances head), one load (consumer reads payload), one store (consumer advances tail), and a branch for the availability check that is almost always predicted correctly. The proof verification adds 25 cycles for the SipHash tag check and the kernel's notification routing. The MMU and alignment gates add zero runtime cost—they are enforced by the hardware page tables and the kernel's page-level permissions, which are already in place.

At 150 cycles for the data plane and 175 cycles with proof verification, KeuOS IPC is competitive with L4 (which reports ~150–200 cycles for a direct message) and within an order of magnitude of a function call. The overhead relative to a direct function call is the cost of an indirect branch (the ring-full check) and the memory ordering implied by the producer-consumer contract—which on x86 is essentially free for aligned stores and loads.

The key insight is not any single optimization. Each is known individually—SPSC rings, DMA into application buffers, cache-line separation of producer and consumer state. What is new is the combination: three independent mechanisms that together eliminate all three structural costs of microkernel IPC, plus proof-carrying descriptors that keep the security properties without putting the kernel on the data path.

NetD: A userspace networking daemon

The first and most demanding client of this IPC path is NetD, the KeuOS networking daemon. NetD runs entirely in userspace, as a Ring 3 process. It owns the NIC device registers (mapped into its address space by the kernel at boot) and runs a full TCP/IP stack in a single thread with an event loop.

Application processes communicate with NetD over SPSC rings. A process that wants to open a TCP connection sends a connect descriptor through the ring. The kernel verifies the SipHash tag and routes the notification to NetD's event loop. NetD processes the TCP handshake, and when the connection is established, it writes incoming packet data directly into the application's receive ring via DMA—no copy, no kernel intervention. The application reads the data from the ring in its own address space, at its own pace.

The result is that network I/O in KeuOS follows the same fast path as local IPC. A recv call that would traverse the kernel's network stack, copy data twice, and wake the receiving process in a traditional microkernel instead resolves to a load from an SPSC ring that is already in L1 cache. The kernel is involved only to route the notification when new data arrives:

OperationTraditional microkernelKeuOS (via NetD)
Packet arrivesNIC → kernel bufferNIC → application ring (DMA)
NotificationKernel schedules network stackKernel routes notification
Data deliveryKernel → network stack → applicationApplication reads ring (load instruction)
syscalls per 1,000 packets~2,000 (send + recv per message)~1,000 (notification only)
Total cycles per packet (avg)~3,500~450

NetD is not just a proof of concept. It handles ARP, IPv4, TCP connection management, and a minimal UDP echo service. The TCP state machine is verified with Z3 contracts on the control flow. The SPSC ring interfaces carry the same proof-carrying descriptors described above. The entire daemon compiles to under 8,000 instructions.

The TCP stack is being actively extended: SYN cookies are defined, the dispatch layer handles connect/send/recv/close, and the keepalive path is wired. We expect to reach a feature-complete TCP server by the end of the quarter.

What remains

The IPC path as described covers the data plane and notification. Two areas are still in progress.

Control-plane operations. Setting up a ring, registering it with the IOMMU, and establishing the SipHash key all require kernel mediation. These operations are infrequent (setup once per channel) and correctly go through the kernel's capability system. The cost of a few dozen syscalls at setup time is irrelevant to the performance of the data plane.

Multi-core wakeup. When NetD has new data to deliver, it must wake the receiving process if it is sleeping. The current mechanism uses an inter-processor interrupt (IPI) that costs ~1 microsecond. For latency-sensitive workloads, a polling mode is under evaluation where the receiver busy-waits on the ring tail, avoiding the IPI entirely. Polling increases CPU usage but brings the round-trip down to the 150-cycle data-plane cost with no notification overhead. The choice between poll and wake is per-channel and configurable.

The design is open source. The kernel boots in QEMU on arm64 and x86-64. The SPSC ring implementation lives in kernel/core/ and the NetD daemon in netd/. Both are written in Salt and carry Z3-verified contracts on their safety-critical paths.

git clone https://github.com/bneb/lattice.git
cd lattice
make setup               # Fetch toolchain and dependencies
make run-qemu            # Boot KeuOS in QEMU
# At the prompt: `spawn netd`

The syscall ABI documentation covers ring setup and notification in detail. The ring implementation is under 400 lines. Everything is at github.com/bneb/lattice.

Read next

Compile-time verification with negative overhead
Adding a Z3 theorem prover made the compiler faster, not slower.

Choosing arenas over borrow checking
Salt has no lifetime annotations. Arena allocation guarantees safety without borrow checking.

Questions or corrections: open an issue.