Table of contents
Open Table of contents
The End of the Line
Ring buffers are the workhorses of streaming data. You find them in audio engines, network stacks, logging systems, and message queues. They are brilliant for producer-consumer scenarios because they use a fixed block of memory to pretend they are infinite.
But they have a problem. The “end of the line.”
The Two-Copy Problem
Everything works beautifully until your write pointer hits the end of the buffer. When you need to write a chunk of data that crosses the boundary, you can’t just copy it. You have to split it.
You have to calculate how much space is left at the end, memcpy that part, then calculate the remainder and memcpy it to the start of the buffer.
void WriteToBuffer(RingBuffer* rb, const void* data, size_t size) {
// ... space check omitted ...
size_t current_head = rb->head;
if (current_head + size <= rb->capacity) {
// Case 1: The write fits in one contiguous block.
memcpy(rb->buffer + current_head, data, size);
} else {
// Case 2: The write wraps around. It must be split.
size_t first_chunk_size = rb->capacity - current_head;
memcpy(rb->buffer + current_head, data, first_chunk_size);
size_t second_chunk_size = size - first_chunk_size;
memcpy(rb->buffer, (char*)data + first_chunk_size, second_chunk_size);
}
// Finally, update the head pointer, wrapping it around with modulo.
rb->head = (current_head + size) % rb->capacity;
}
This conditional logic kills performance. That if statement is a prime candidate for branch misprediction, which stalls the CPU. It also makes the API messy. You can’t just ask for a pointer to write into because the memory might not be contiguous.
What if we could eliminate the cliff? What if the buffer was actually infinite?
The Virtual Memory Trick
We can cheat. We can use our knowledge of virtual memory to create a new reality.
The trick is to ask the OS to map the same block of physical RAM to two contiguous regions of virtual address space.
This creates a virtual buffer where the second half is a perfect, byte-for-byte mirror of the first half. Writing to buffer[0] also writes to buffer[capacity] because they are literally the same physical memory.
This is most practical in 64-bit applications where virtual address space is plentiful. The only catch is that the buffer size has to align with the OS page granularity.
Seeing the Magic in Action
With this setup, the two-copy problem vanishes. You can perform a single memcpy that seamlessly wraps around the buffer boundary. No special logic required.
Here is a demonstration. We write a string that is intentionally placed to cross the boundary of the physical memory.
const size_t requested_size = 65536;
char* buffer = CreateMirroredRingBuffer(requested_size);
const char message[] = "HELLO!";
size_t message_len = sizeof(message) - 1;
size_t start_index = requested_size - 3; // Position write to force a wrap-around.
// This single memcpy will cross the boundary.
memcpy(&buffer[start_index], message, message_len);
// We can now read the entire, contiguous message by pointing
// to the start of the write in our "virtual" buffer.
printf("Contiguous message after wrap-around: %.*s\n",
(int)message_len, &buffer[start_index]);
The output proves that the single memcpy worked as if the buffer were truly linear. The memory region starting at start_index contains the full, unbroken message. The hardware handled the wrap-around for us.
Contiguous message after wrap-around: HELLO!
Now let’s see how to build this on different platforms.
Building on Windows
The modern Windows implementation uses VirtualAlloc2 and MapViewOfFile3. These allow us to create and map into memory “placeholders” without race conditions.
Step 1: Align Size to Granularity
On Windows, the buffer size must be a multiple of the system’s allocation granularity (usually 64KB). We get this from GetSystemInfo and round up.
SYSTEM_INFO sysInfo;
GetSystemInfo(&sysInfo);
const size_t granularity = sysInfo.dwAllocationGranularity;
// Round the requested size up to the nearest multiple of granularity.
size_t aligned_size = (size + granularity - 1) & ~(granularity - 1);
Step 2: Create a Page-File-Backed Memory Section
We use CreateFileMapping to ask the OS for a chunk of committable memory. Since this is an older Win32 API, we have to split our 64-bit size into two 32-bit arguments.
// Split the 64-bit size for the CreateFileMapping function.
DWORD size_high = (DWORD)(aligned_size >> 32);
DWORD size_low = (DWORD)(aligned_size & 0xFFFFFFFF);
HANDLE mapping = CreateFileMapping(INVALID_HANDLE_VALUE, NULL,
PAGE_READWRITE, size_high, size_low, NULL);
Step 3: Reserve a Combined Virtual Address Placeholder
We reserve a single, contiguous virtual address range that is twice the size of our aligned buffer. This is our canvas.
void* placeholder = VirtualAlloc2(GetCurrentProcess(), NULL, 2 * aligned_size,
MEM_RESERVE | MEM_RESERVE_PLACEHOLDER,
PAGE_NOACCESS, NULL, 0);
Step 4: Split the Placeholder
This is the tricky part. MapViewOfFile3 requires the placeholder to be the exact same size as the view being mapped. We can’t map a size-byte view into a 2*size-byte placeholder.
So we split it. We call VirtualFree on the second half with MEM_PRESERVE_PLACEHOLDER. This tells the OS to chop our large placeholder into two smaller, independent ones.
// Split the 2*size placeholder into two separate 'size' placeholders.
VirtualFree((char*)placeholder + aligned_size, aligned_size,
MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER);
Step 5: Map the Memory Section into Each Placeholder
Now we have two correctly-sized placeholders. We map our memory section into each one using MapViewOfFile3 with MEM_REPLACE_PLACEHOLDER. This atomically swaps the placeholder for our active memory.
// Map the first half into the first placeholder.
void* view1 = MapViewOfFile3(mapping, GetCurrentProcess(), placeholder,
0, aligned_size, MEM_REPLACE_PLACEHOLDER,
PAGE_READWRITE, NULL, 0);
// Map the second half into the second placeholder.
void* view2 = MapViewOfFile3(mapping, GetCurrentProcess(),
(char*)placeholder + aligned_size, 0, aligned_size,
MEM_REPLACE_PLACEHOLDER,
PAGE_READWRITE, NULL, 0);
If both calls succeed, we have our mirrored buffer. You can close the mapping handle now since the views keep the memory alive.
Note on Older Windows:
VirtualAlloc2andMapViewOfFile3are relatively new. On older systems, you have to reserve a2*sizeblock,VirtualFreeit, and then race toMapViewOfFileExboth halves before another thread steals the address space. Robust code needs a retry loop for that.
Building on POSIX (Linux/macOS)
The POSIX implementation uses mmap. We need a sharable file descriptor to make this work.
Step 1: Round Up to Page Size
POSIX mmap requires the size to be a multiple of the system page size (usually 4KB). We get this from sysconf.
long page_size = sysconf(_SC_PAGESIZE);
// Round the requested size up to the nearest multiple of the page size.
size_t aligned_size = (size + page_size - 1) & ~((size_t)page_size - 1);
Step 2: Get a Sharable File Descriptor
We need a file descriptor (fd) for the physical memory. On Linux, memfd_create is perfect because it creates an anonymous in-memory file. On macOS, we fall back to shm_open.
shm_open needs a unique name, so we generate one with the process ID and a counter, retrying if it collides.
int fd = -1;
#if __linux__
// memfd_create is preferred: no name collisions and no filesystem presence.
fd = syscall(SYS_memfd_create, "ring-buffer", MFD_CLOEXEC);
#endif
if (fd == -1) {
// shm_open is the POSIX fallback for macOS and older Linux.
char path[256];
int retries = 100;
do {
// Generate a unique name.
snprintf(path, sizeof(path), "/ring-buffer-%d-%d", getpid(), retries);
fd = shm_open(path, O_RDWR | O_CREAT | O_EXCL | O_CLOEXEC, 0600);
retries--;
} while (fd < 0 && errno == EEXIST && retries > 0);
if (fd < 0) return nullptr;
// Immediately unlink the path. The memory object will persist
// until the last fd is closed, ensuring automatic cleanup.
shm_unlink(path);
}
// Set the size of the memory object.
ftruncate(fd, (off_t)aligned_size);
Step 3: Reserve a Contiguous Virtual Address Range
We need two contiguous virtual memory blocks. The safest way is to reserve a single block that is twice the size. We use mmap with PROT_NONE to reserve the address range without allocating physical memory.
void* placeholder = mmap(nullptr, 2 * aligned_size, PROT_NONE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
Step 4: Map the File Descriptor Twice
Now we map our file descriptor into the two halves of the placeholder. We use MAP_FIXED to force mmap to use our reserved addresses. Crucially, we use MAP_SHARED so that writes to one mapping show up in the other.
// Map the first half.
void* view1 = mmap(placeholder, aligned_size, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_SHARED, fd, 0);
// Map the second half.
void* view2 = mmap((char*)placeholder + aligned_size, aligned_size,
PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_SHARED, fd, 0);
If both succeed, you can close the fd.
POSIX Gotchas:
shm_openNaming: Sinceshm_openuses global names, collisions are possible. The retry loop helps, but production code might want a stronger random name generator.MAP_SHAREDis Mandatory: You must useMAP_SHARED.MAP_PRIVATEwould create copy-on-write pages, breaking the mirror effect.
Taking it Further
This isn’t just a parlor trick. It’s a building block for high-performance systems. Once you eliminate the wrap-around logic, you can build some incredible things:
- Lock-Free Queues: A perfect foundation for SPSC (Single Producer Single Consumer) queues. You can write data of any size without complex boundary checks.
- Real-Time Audio: Audio processing often needs a sliding window of samples. A mirrored buffer lets you pass a simple pointer to your DSP filters, even if the window wraps around the buffer end.
- Network Packet Assembly: Write incoming fragments directly into the buffer. Once a full packet arrives, you have it in contiguous memory, ready to process without an extra copy.
- Command Buffers: Your renderer can generate commands in a tight loop, writing them straight to the buffer for the GPU.
Conclusion
We took a common problem, the ring buffer wrap-around, and we fixed it. Not with more code, but with better memory management.
We traded some virtual address space for a branch-free hot path. In streaming applications where every cycle counts, that is a fantastic bargain.
This is what data-oriented design is really about. It’s not just about cache lines. It’s about understanding the hardware and the OS well enough to make them do the heavy lifting for you.