'Broadcast' pipes on Linux
Published on
A little while ago, I wrote a couple of Fediverse posts about
and subsequently for implementing one without coordination of a list of recipients. (Those are Misskey links, so unless you have a high-powered machine I recommend copy-pasting the URLs into another Fediverse server's search box to load them there instead of opening them directly...)At the time, I said I would write a blog post about the topic maybe the next day. Well, that didn't happen then, but now it is happening. In this post I'll explain what a broadcast pipe
is, the use case for such a thing, my solution, and the weaknesses in it that I'm aware of.
I didn't actually write or test code implementing this, so there could be a problem in the design I'm unaware of, but as far as I'm aware it should work in theory.
The problem
A pipe
, on Unix, is a one-way stream of bytes between two processes. One end of the pipe acts like a write-only file, and the other end like a read-only file. Anything written to the write end can later be read from the read end. This provides a simple but powerful inter-process communication mechanism that Unix uses mainly to implement pipelines
, which are chains of processes where the output of each is directed to the input of the next. Another kind of pipe is called a FIFO file, or named pipe
. This is a file in the filesystem, but when opened it provides you with one end or the other of a persistent pipe (depending on whether you open it for reading or for writing). A FIFO is usually used to provide a way of sending commands to a process, a queue of data to be processed, or other similar constructions.
What I want is the ability to use FIFOs, or pipes, to send messages to multiple receivers. My main use case is the implementation of an event system for LENS. I don't want receivers to have to register
themselves with a central broker
process or something similar, which eliminates possibilities such as a directory full of FIFOs, where a message is written to all of them. In practice, that exact solution is probably the best way to implement this, but I'd like to see what the alternative looks like. Since LENS runs atop Linux (the kernel), I don't need to worry about other Unix systems and will be focusing on the facilities provided by Linux.
The solution
The critical discovery to make it possible to implement this is a Linux system call named tee
. The tee
system call performs a (logical) copy from one pipe to another, but doesn't remove the data from the first pipe. This means that multiple readers can get the same data from the same pipe. Each reader creates a second, private pipe, which it uses tee
to fill with data from the broadcast pipe. It can then read the data from its private pipe at its convenience.
Unfortunately, there is a problem: If readers don't remove data from the broadcast pipe, the data is never removed at all. As a result, it gradually fills up with data. A pipe has a maximum capacity, which, when reached, will prevent more data from being written. Also, all of the historical data in the pipe remains there forever and will be read by readers who join later, which might be undesirable if the data expires. To solve this, somebody needs to read the data from the pipe, and the sensible participant to have this responsibility is the writer, since it put the data there in the first place.
While I wanted to be able to have a completely unaware writer, which doesn't even need to know it's doing anything special, this isn't feasible, at least with the tee
approach. The writer will need to have some special logic. Additionally, because we don't want a race condition between readers trying to get the data and the writer trying to clear out the pipe, there needs to be some kind of locking in the picture.
The sub-problem
The lock we need is one that allows only one writer at a time (because we only need one at a time, and it simplifies the implementation), but allows many readers. The writer needs to clear out the data from the pipe only after all readers have done their tee
calls, otherwise some readers will miss it; the lock will control erasing the data, not the actual write. But, readers may try to take the lock again quickly after they release it, so bad timing could lead to the writer never getting it. This would result in the readers all getting duplicate data until the writer eventually won the race to take the lock. So the lock needs to ensure a writer always wins the race.
To summarize, we need a reader-writer lock that ensures that if a writer is waiting, readers will never take the lock before it gets it. There isn't a built-in lock in Linux with these semantics, so we need to build it ourselves.
The sub-solution
Fortunately, Linux has a dedicated feature for building your own locks. A futex is a synchronization primitive provided by the kernel, consisting of a 32-bit futex word
, access to which is synchronized using the futex
system call. This system call can do one of a few things, depending on flags. For our purposes, there are two operations:
- Atomically check that the futex word has an expected value (and return if not), then block until the futex is signaled to indicate a change in its value. The check ensures that the value isn't changed by another process/thread between when the user-space locking logic decides to wait and when the
futex
call runs, which could result in waiting on an available lock or similar problems. - Signal that the futex word has changed value, waking up a specified number of waiters.
Futexes are generally used in a loop, where you repeatedly check if the lock is in a state that allows you to take it, calling futex
to wait if not, and then eventually using atomic CPU instructions like compare-and-exchange to update the lock to indicate that you have it. If someone else touches the lock while you're trying to take it, you restart the loop.
We'll implement our reader-writer lock using a futex. The futex word will be UINT32_MAX
if a writer holds the lock (because nobody else can take it while a writer has it), and otherwise the low 31 bits will indicate the number of readers holding it, while the uppermost bit will indicate that at least one writer is waiting to take the lock (and therefore that no new readers should take it). Zero indicates no writer waiting, and no readers holding the lock, meaning the next comer gets the lock immediately.
A reader will follow these steps:
- Wait for the high bit to be unset. If a writer holds the lock,
UINT32_MAX
has a set high bit. If a writer is waiting, this avoids starving it. - Atomically increment the futex word. If the lock is unheld (zero), this takes the lock for readers. If it's held already, it increases the number of readers holding it. If the increment fails, start over.
- When done reading, atomically decrement the futex word. If it decremented to zero, then wake all waiters.
A writer, meanwhile, will follow these steps:
- Atomically set the high bit (if it is already set, that's fine, continue).
- Wait for the futex word to be
2147483648
, or0x80000000
in hex (all zeroes except the high bit, indicating a writer (us) is waiting, and no readers or writers hold the lock). On each wake, ensure the high bit is still set. - Atomically set the futex word to
UINT32_MAX
. If the atomic set fails, start over (another writer took the lock). - When done, set the futex word to zero (doesn't need to be atomic because the writer holds exclusive control of the lock, and the most that will happen is atomic sets of the high bit by other writers waiting, which don't change the value), then wake all waiters.
The problems with this lock are that it frequently falls victim to a thundering herd, and it will sometimes give writers two turns in a row, which for our purposes could result in readers missing a write. This can probably be mitigated to some extent by writers waiting briefly after unlocking before trying to lock again, but not if there is more than one writer.
Now that we have our lock, we can actually implement a broadcast pipe. The broadcast pipe consists of the pipe itself and a shared-memory region holding the futex (in retrospect, we could probably just put the data into the shared memory... it's always going to have to use up at least PAGESIZE
of memory). A single writer can follow these steps (multiple writers would step on each other, necessitating a second simpler lock ensuring there is only one writer):
- Write the data, without locking.
- Immediately try to take the lock. Readers will be holding the lock, and the writer will wait until they all release it, signalling they have read the data.
- Read the data using
read
, removing it from the pipe. Do something with it or throw it away. - Release the lock.
Meanwhile, readers follow these steps:
- Create a private pipe to store retrieved data for later reading.
- Take the lock.
- Use
tee
to peek the data into the private pipe. Read it later when it's convenient. - Release the lock, signalling that the data has been read.
Obviously, a reader can re-use the private pipe from one read to another.
Is this useful?
No.
Well, the lock might be. But the broadcast pipe is too complicated to be a savings over a directory full of FIFOs/sockets, a broker process, or lock-controlled shared memory. I don't plan to actually use this anywhere, but it was interesting to design.
Tagged:
- Personal
- Linux
- Programming