Hacker News new | threads | past | comments | ask | show | jobs | submit DanielBMarkham (43823) | logout
A lock-free, concurrent, generic queue in 32 bits (nullprogram.com)
162 points by signa11 2 days ago | flag | hide | past | favorite | 89 comments

32bit here turns out to mean head and tail indices are in the same 32bit integer and the data is somewhere else. That's enough to put head and tail in the same cache line, so it's a disaster for performance on roughly any platform.

The mpmc variant looks non-serialising (threads push multiple things before another manages to push one) so it's not a queue, and it has a cas loop in it so lock-free is a bit of a stretch.

Note that the second variant does not claim to be MPMC, only SPMC.

Many lock-free structures use CAS loops. It would not be lock-free if a consumer pre-empted at the wrong time could prevent other consumers to make progress, but I don't think this is the case here.

> it has a cas loop in it so lock-free is a bit of a stretch.

To be fair, if that's your standard, there are virtually no lock-free algorithms. The multi-consumer has a wait-free retry loop. That's pretty much as good as it gets.

And the SPSC case is lock-free under even the strictest definitions, though of course that's not particularly difficult to achieve.

That's reasonable. I'm twitchy about CAS loops because I'm used to unfair schedulers, but they're a great choice in Linux/x64 land

Why is having head and tail in the same cache line bad for performance? Given that the code reads both head and tail to check if the queue is empty, being in the same cache line seems optimal.

https://rigtorp.se/ringbuffer/, from a comment below, answered my question.

In rough terms, every push or pop from the queue also causes head or tail to be written, thus dirtying a line of cache. That’s the bit I misunderstood.

To be clear, as mentioned in another thread [0] there is no need to ensure consistency between head and tail for a SPSC so storing them in different cache lines would be much better.

For the multiple-consumer variant, however, consistency between head and tail is required so packing them in the same atomic is the simplest solution.

[0] https://news.ycombinator.com/item?id=31386741

Head and tail not being the same abstraction is the central theme for why alternatives like the LMAX Disruptor are so fast, even in MP/MC scenarios.

This queue seems to have issues with multi-producer operation. Each producer busy polls until the previous producer finishes. What happens if the previous producer gets de-scheduled while you are busy polling? You keep occupying a core that the other thread could use to do useful work.

When I transitioned from trading to tech, I realized that HFT data structures have limited usefulness when you are not also doing HFT. You can do a lot more things lock-free when you don't have to worry about adverse scheduling.

By the way, there is no easy way to fix this with an MPMC ring buffer. You have to accept some chance of adverse behavior. There may be something complicated you can do, but it's easier to just split into multiple rings.

> What happens if the previous producer gets de-scheduled while you are busy polling? You keep occupying a core that the other thread could use to do useful work.

This is why newer versions of LMAX offer additional wait strategies beyond BusyWait. The one I prefer to use, unless latency is absolutely insanely critical, is the YieldingWait strategy.

If yieldingwait does what I think it does, this does not work nearly as effectively as you might think. I am assuming that YieldingWait just calls "sched_yield" once in a while, assuming that the kernel will interpret that correctly. That is a bad assumption: the kernel gets sched_yield calls in a lot of places and it means different things to different people.

If YieldingWait does a proper yield, by calling futex, well then it's not really lock free any more.

This is cool but it seems like it occupies an unusual place in tradeoff space. Most of the time SPSC queues would benefit from having the start and end pointers in separate pages, or even having cached copies of the start and end pointers in even more separate pages to avoid atomics[0].

[0]: https://rigtorp.se/ringbuffer/

But then those accesses have to be protected by a lock in order to enforce invariants between "start" and "end". The whole point of representing both with a single atomic value is that this makes it trivial to preserve those required invariants.

Because the queue grows only in one direction and it is consumed in the other direction, reading a stale value from the other side is still safe, so two-word atomicity is not needed.

In fact the other side value is normally cached in a local memory location so to amortize the cost of roundtrips over multiple reads and writes.

Only for SPSC. You need H/T consistency in other cases for full/empty checks.

Cmon AMD/Intel, give us hardware queues already :)

separate cache lines not pages :)

You are correct, of course

Why is the _Atomic qualifier on the actual data array here required? Shouldn't it be part of the atomic qualifiers on the index read/write operations to make sure those are not reordered and the memory is actually valid? Given array elements could be > 64bit in size I guess it won't be to make the access actually atomic? Is it another hint to the compiler not to reorder things? Not super familiar with those C qualifiers, and therefore curious.

And is the acquire ordering on push/pop and release on commit the right thing?

> Why is the _Atomic qualifier on the actual data array here required?

Also, my C skills are lacking but I recall that C's support for atomic data structures is a higher-level abstraction that could very well use locks under the hood.

C's support for atomics even provides a atomic_is_lock_free function which is used to check whether a data structure is lock-free or not.


Using ints with C's atomics does not ensure the atomic data type is lock-free. C even provides a macro to verify whether the primitive data type has lock-free access or not.


Using C's atomic in this context to claim lock-free status sounds a lot like hand-waving over the problem domain to pretend locks aren't there. It's a cool read, but the underlying claim doesn't really hold.

I've heard rumors that on modern processors, "lock free" doesn't even make a ton of sense. It's actually contention that is costly, when there is no contention atomic data structures are virtually free as I understand it.

Don't know if there is data to back that up, though.

Modern processors use something like https://en.m.wikipedia.org/wiki/MOESI_protocol to share data between processors.

So basically, writing a shared line is expensive. Atomic reads are free, as are atomic writes to a cache line that no one else is using.

It's never truly free but on some CPUs (e.g. x86 and x86-64) you're always paying for it anyway and so you might as well use it.

You can think of a lock-free data structure as a data structure with an embedded wait-free lock implemented with atomics. This has several advantages:

- Less memory - a pthread mutex is around 40 bytes, compared to 4 bytes for this queue.

- No waiting - in some cases you don't want to wait, just do a try-push and if you can't, do some other work with the thread and then try again. Eventually you will have to either wait or discard the data, depending on your situation.

- No syscalls - mutexes weren't always as light as they are now, older implementations would always do a syscall and context switch to take or release a mutex.

> Less memory - a pthread mutex is around 40 bytes, compared to 4 bytes for this queue.

This sort of thing is pretty ugly, and is one place where "portable" C and C++ really betrays the value of the language. Chances are many (often all) your uses are Linux, and so a futex is all you actually needed, which costs 4 bytes like this structure. But to be portable you can't ask for a futex, and it's deliberately not made easy to try to directly call into Linux system calls for futex since you'll probably get it wrong and then blame everybody else.

So you ask for pthread_mutex and now it's ten times bigger and you're regretting your life choices.

> So you ask for pthread_mutex and now it's ten times bigger and you're regretting your life choices.

It's 36 bytes bigger.

Unless you're for some reason doing embedded development with a resource constrained system running a UNIX-like OS, that is not a concern that registers anywhere.

And you get standard code that runs anywhere.

Arguing about 36 bytes per critical section makes no sense. You waste far more than that by handling a string.

It is very much a concern as you can't as easily fit your mutex protected data on the same cacheline as the mutex.

> wait-free lock

That's an oxymoron.

Lock/wait freedom is not about performance but about progress guarantees. Often lock free data structures are not faster than locked ones. Reducing contention applies to both locked and lock-free structures.

There’s a paragraph in the article about this.

“The single producer and single consumer didn’t require locks nor atomic accesses to the storage array since the queue guaranteed that accesses at the specified index were not concurrent. However, this is not the case with multiple-consumers. Consumers race when popping. The loser’s access might occur after the winner’s commit, making its access concurrent with the producer. Both producer and consumers must account for this.”

Yes, I read that one, but still didn't understand what will change in code generation due to it. If it's about the read on the consumer side not being reordered after the commit, then I assumed the release on the atomic will take care of it. Maybe I'll throw it into godbolt later on to figure out.

The data array / buffer doesn't need to be atomic. Once the atomic load/store on indices has justified access to the data you're good.

C puts atomic on the type, not the operation. That's conservative and occasionally very annoying (can't use normal stores in some parts of the program and atomic in others, thanks to aliasing rules) but doesn't hurt this particular application.

Why do so many people use the head==tail as full condition?

You can use the entire buffer and not waste one slot by doing:

push: buf[writeidx % bufsize] = x;


pop: x = buf[readidx % bufsize];


Unsigned integers, you get size by doing: writeidx - readidx; Then you check size==0 or size==bufsize to detect empty/full.

Fabian Giesen's always excellent blog has a post about various ring buffer implementations: https://fgiesen.wordpress.com/2010/12/14/ring-buffers-and-qu...

He has a companion post about a "magic" ring buffer that uses a virtual memory mapping to make a ring buffer where you can always address the entire sizes as contiguous memory, but I think it has a bit of minor bit rot in the linked implementations. https://fgiesen.wordpress.com/2012/07/21/the-magic-ring-buff...

This method is fine, but with it you must always take care to choose bufsize as a power of two, otherwise the overhead of having to do integer divisions for all queue operations may be noticeable.

> but with it you must always take care to choose bufsize as a power of two, otherwise the overhead of having to do integer divisions

Keep in mind that one of the reasons behind the arbitrary "power of two" requirement is that the method proposed by the author abuses a single atomic integer to hold two shorts representing the indices for the head and tail elements.

I don't think it makes much sense to talk about overhead given this method requires doing a bunch of bitmasks to operate over the high and low short.

The overhead due to bit masks and rotations/shifts is negligible on almost all CPUs.

Larger CPUs can do many such operations during every clock cycle and even in very small CPUs they are usually single-cycle operations.

Depending on the CPU, integer divisions are 10 to 100 times slower, so in a completely different range of overhead.

This breaks when your indices wrap around the integer limit. I'm not saying this can't be worked around but you end up with something much more complicated than just head==tail. It's a memory/runtime trade-off that I think is worth wasting one slot for.

No, nothing breaks with unsigned integers.

Modular arithmetic makes it work. Try a few examples close to UINT_MAX if you want to convince yourself.

Right, you resolve the ambiguity of the wraparound because we know that logically the read_idx (which I guess is better called pop_idx) can never exceed the write_idx. It does fail if bufsize equals 2^w where w is the width of your index type, but eh, that's fair enough. You still 'waste' a slot but it's in the address space rather than physically.

It does imply that UINT_MAX needs to be divisible by “bufsize” if I’m not mistaken, right? So that the modulo boundaries align with the overflow.

Not necessarily divisible, but requires power of 2 bufsize with this code (but you probably want that anyway to avoid division).

With non-power-2 bufsize you need some extra logic for write/readix++, but the rest of the code can stay the same.

That doesn't seem right...

If bufsize is a power of 2, doing the modulo would be exactly the same as just using a smaller integer type (like 8 bits on a microcontroller). What we want is the integer wraparound to never coincide with the buffer position wrapping. For that we would need a size that has some other factor than 2, and thus a "real" modulo operation in the indexing (which is slow).

Or is there an error in my thinking?

edit: Duh, I understand now. If the counter has at least twice the range as the buffer size, the subtraction will always give the correct number of elements used. Power of 2 or not doesn't matter.

Not having to do the subtraction could still give a performance benefit, and if the buffer elements aren't very large, one wasted slot is worth the tradeoff.

Like for any algorithm choice, the optimal choice depends on the context.

On a desktop/laptop computer, wasting a queue slot matters very little, but it may be useful to achieve the maximum speed, so this method does not seem preferable.

On the other hand, in many microcontroller applications there is no other read/write RAM, but a few kilobytes of internal MCU RAM. So the amount of used RAM may be the most important resource and there may be many FIFO queues for various peripheral interfaces.

For such MCU applications, it is valuable to be aware of this alternative FIFO queue implementation, as it may be the best choice.

It took me a bit to understand how tcp seqnums work in case of wraparound, but the trick is exactly that: everything works fine as long as you have less than 2*31 unack'd bytes.

Ring buffers work exactly the same way

Easier to sanity check if you use uint8 for the indices

This creates an ambiguity for testing the full and empty conditions. You end up having to store an additional atomic flag to track that state.

The multiple-consumer version looks extremely similar, if not identical, to the Go [1] and Tokio [2] local work-stealing queues.

These do a bit more though, as they also allow batch stealing operations.

[1] https://github.com/golang/go/blob/master/src/runtime/proc.go...

[2] https://tokio.rs/blog/2019-10-scheduler#a-better-run-queue

I wonder if there’s a way to integrate futexes in there to let consumers and producers park. Otherwise whenever the rate of production or consumption is meaningfully different (which it is in the real world), there’s a lot of CPU being wasted…

A lock free queue still helps to have, even if you use an explicit semaphore or futex to synchronize threads at a higher level. A producer can queue up multiple items per post, and a consumer can empty the queue between pends. Without the lock free queue, the queue itself would have to be protected by a mutex. It's actually nice when lock free data structure implementations leave OS level synchronization up to the caller. It makes the implementation more generally useful and portable. It's perhaps a less obvious example of "dependency injection".

I can't speak to the use case of TFA, but in some contexts where similar data structures are prominently used we park by spin waiting (which may include looking at all our other queues) because we don't want to pay a bunch of microseconds of latency for actually being descheduled and resheduled again.

Maybe TFA is for embedded use?

I don't see how spin waiting constitutes parking? In what sense is the execution context "parked" if in fact it's just spinning?

Not the GP, but I read "we use spin waiting" as a response to "aren't you wasting a bunch of CPU cycles" in the sense that "yes, but it's the least bad option".

Yes! You can use (futex based) event counts to add blocking behaviour to any lock free queue.

An SPSC circular queue can be implemented with just load and stores and acquire/release consistency (which is very cheap on x86), so using a single variable for both read and write pointers seems quite suboptimal.

Using two distinct 16bit _Atomic variables would have been an improvement, but I guess having both pointers in the same cacheline is going through be the bottleneck anyway.

> It can store up to 32,767 elements at a time — more than enough for message queues, which must always be bounded.

Imagine I have a Gui where the user presses a button, upon which a message is sent to a queue, upon which a consumer reads it and performs some long-running task. Why should that queue be bounded?

In an interactive application, you probably don't want to allow the user to queue up 30000 actions that will then slowly get executed over the next, say, 10 minutes. That would mean there is an extremely large delay between taking a UI action and seeing the result of that action (because the other 29999 actions have to be handled before the latest one can be handled). That's also why some applications display a progress spinner and block you from interacting with the UI further (of course that's not always ideal either, you might want to have some small queue and optimistically update the UI already).

If they are just batch background tasks (i.e. there is no expectation of seeing some immediate UI update as a response), then having a larger queue might be fine. There's not really that much reason to put a (small) limit on say the number of files to be downloaded in a queue. And having a really large queue bound is not that dissimilar from having an arbitrarily large queue (limited be memory of course), in terms of buffer bloat and added latency.

> If they are just batch background tasks (i.e. there is no expectation of seeing some immediate UI update as a response), then having a larger queue might be fine.

"Larger" is not good enough; you really want an unbounded queue (unbounded in theory of course), because otherwise the GUI thread might block at some point. Unless you want to spend a lot of effort on exception handling (at which point unbounded is still the best choice, see below).

Also, unnecessary restrictions in your program might cause problems in the future when your software is ported to a larger architecture, or when demands change.

The point is there should always be appropriate backpressure. For a GUI, lockup would be inappropriate backpressure. More appropriate would be some interactive backpressure -- even a modal spinning wheel is better than queuing additional rage-clicks driven by user frustration at slowness, that are all wasted.

There are bo truly unbounded queues. As some point your application will start trashing swap or trigger OOM. Backpressure is always the right solution (but if course sizing queues is hard).

Why would it be preferable for it to be unbounded?

Bounds let you have a `try_push` function that can obviously fail, whereas with unbounded you can only really know it failed because of OOM, or your producer thread gets blocked.

When `try_push` failure happens you know something strange is happening in your code.

Unbounded can also have a try_push() functionality.

Unbounded is preferable because that's how we do everything else, and best prepares for porting to larger architectures. Do your strings have a maximum size, for example?

We have bounded queues because consumers and producers can consume at different rates. An unbounded queue will have the producer fill up main memory and then you need to do load shedding which turned the whole unbounded story into a lie.

In what circumstances would a ‘try_push’ fail on an unbounded queue? On a bounded queue it would fail when the queue is full.

The size of the queue can be read from a config file, and is tuned to the capability of the machine.

> In what circumstances would a ‘try_push’ fail on an unbounded queue?

When you have an out-of-memory situation.

The thing you are trying to express is that you want a queue that stops working only when it physically can't work. Putting arbitrary large numbers in config files is not the way to express it.

This doesn't seem correct. What if, for example, the long-running tasks would like to allocate memory, and they cannot because the user has clicked the button a few billion times? In that case, it seems like perhaps we should have accepted fewer jobs, so that we would be able to perform the ones we accepted.

If the only thing you need to allocate memory to is your queue, then maybe. Otherwise you'd incur in problems like the sibling comment. That's the very same reason you limit connection pools on applications that connect to a database: to prevent some instance to use all the database server connections and leave nothing to other instances.

So then your program gets killed. Using a bounded queue let's you handle these situations gracefully and then you can tie these errors to alerts, which are then tied to a scaling policy.

Does your device have infinite memory?

The c code examples are odd:

For (;;) … While (condition…) …

Why not use while in both situations, so while(1)… instead of for(;;) ?

The semantics of the for loop with an omitted condition more accurately conveys the intent. "Loop forever", vs. "Loop while true is true". A clever compiler will optimize it to the same machine code, but conceptually the while loop is more complex. It requires understanding what a tautology is, for example.

I’m used to seeing for(;;) in C code but I disagree that it’s conceptually similar - the opposite actually. The for() syntax is already inconsistent with three expressions being interpreted as statement;guard-expression;statement.

There’s no other occasion I can think of where an expression becomes vacuously true by being omitted. If it were conceptually coherent you’d write for(;true;). I guess maybe the reason they made C accept for(;;) but not while() is because while() might confuse a compiler by looking like a function call.

You're arguing syntax, I'm arguing semantics. When the intent is to have the program loop unconditionally, omitting a condition seems more appropriate than providing a tautological condition.

Syntax wise, it doesn't really matter. "for" just happens to be the keyword that allows omitting the condition. "While" effectively means "during which". "For" on the other hand has a significantly more flexible meaning; it seems reasonable to me that the for keyword would be more flexible. Syntax wise, the most explicit thing to have in a language would be a "forever" keyword of some sort.

Even a really stupid compiler is going to map for(;;) and while(1) to the same thing. An unconditional back branch at the end of the block. Anything else would be more work.

> Why not use while in both situations, so while(1)… instead of for(;;) ?

Because some compilers warn that the condition is constant and suggest you rewrite it as for (;;)

Both are common in idiomatic C.

for (;;) is less typing.

Someone made a similar comment on my code recently, is that a generational thing?

for (;;) has always been the idiomatic way to do an infinite loop.

I started programming in C circa 1997, following a brother who started a few years earlier, and always used while(1).

FWIW https://github.com/robohack/ucb-csrg-bsd contains both variants.

Many many decades ago, non-optimising compilers would literally check 1==1 on each iteration of the latter.

Not a problem now.

And why would you need to write while(1==1) and not just while(1)?

You don’t - but the compiler needs to test if 1 is true (so conceptually 1==1) every iteration if it isn’t an optimising compiler.

They lack the code to remove a tautological test.

But approximately all compilers these days are optimising compilers, so don’t worry about it.

Yes that I was questioning the, to me, non-sensical 1==1. Why would you not for the normal form of the tautological statement, And if you don't go for the normal form, why choose 1==1 and not one of the infinite other forms.

> Why would you not for the normal form of the tautological statement

Because normalising is an optimization, and these compilers didn't have any optimisations. That was normal back then. You had a whole separate product that was specifically an 'optimising compiler'. These days that's implicit and we just say 'compiler'.

These days, even -O0 will canonicalise the condition! That's how normalised it is! But I guess this is done during translation to IR, rather than as an 'optimisation'.


The question is not a question of optimization but readability. Can you honestly say you would let `while (1 + 3 * 5 < 10 + 8 - 5 * 1000)` stand in a code review for a simple infinite loop because "the compiler will optimize it anyway".

You’re confused - nobody is suggesting writing anything but ‘while (1)’ at the source code level. We’re talking about how the compiler implements that, and how internally it needs to do ‘1 == 1’ for each iteration if it doesn’t optimise at all.

We’re talking compiler implementation and optimisation - not coding aesthetics.

That makes little sense. There is no equality check in the generated ASM. A compuler doesn't synthesize 1==1 when a `while(1)` is used. Your link even shows this. There is no `cmp` opcode used. A simple unconditional `jmp` is used.

Are you trolling me?

> Your link even shows this.

Yes the link shows how it doesn't matter today! Because almost all compilers are now 'optimising compilers'!

> There is no `cmp` opcode used. A simple unconditional `jmp` is used.

Oh my god that's the entire point! That's how compilers work now. They always optimise. They didn't used to do that.

The compiler checks the truthiness of the value in the while condition. Modern compilers remove it with either a tautology check, or a shortcut during IR building. They didn't used to do that, so there used to be a cost, which is why people started cargo-culting the for equivalent.

Here's a worked example using a niche compiler with a very different architecture that unusually still doesn't canonicalise during building.





See the difference? See the 'cmp eax, 1' and then how it goes away under optimisation?

really "while (1)" is the same as "while (1 != 0)" but it's not really any difference, and as mentioned 1==1 and 1!=0 will be optimised out to the same thing

it's no big deal, just one person's coding style, I've done the same thing for at least 40 years now

I personally prefer ‘for’ for both cases actually ;)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact