"My first attempt at a benchmark involved allocating and freeing blocks of random size. Twitter friends correctly scolded me and said that's not good enough. I need real data with real allocation patterns and sizes."
"The goal is to create a "journal" of memory operations. It should record malloc and free operations with their inputs and outputs. Then the journal can be replayed with different allocators to compare performance."
I don't think that is sufficient either. You need to mix the malloc/free with the other work, because the other work knocks your allocator data structures out of cache and pollutes your TLB and branch predictor. And loads up the dram interface (eg by the gpu fetching data from it). Etc etc.
I would have measured the time directly inside of the Doom allocator and the OP could produce the same charts directly from the measured data without replaying the allocation log.
And IF I were going to collect that data, I'd grab the call stack as well to see where the allocations were coming from. There might be a chance to retro fit an arena allocator, but would think Carmack code is already probably optimal? Yes it is, for the reason we can even have this discussion about it.
The other change I would make to the experiment, is to have the traces be taken from a timedemo and not via regular gameplay so that hypothesis can be tested. At the very least, one could precisely time the impact of instrumentation.
> I would have measured the time directly inside of the Doom allocator
Exactly! The author already has all the infra necessary (rdtsc for each call). Measuring time inside the game would be more accurate and simpler. Why did the author do things this way? I must be missing something.
(By the way, the author even seems to know about this issue since they added code that simulates using the allocated blocks (touching one byte for every allocated 4k), but that does not feel like it's nearly enough.)
> grab the call stack as well
Maybe this part could be done with no code with VTune / Linux perf? Sure, those only gather stochastic measurements (so not ideal for the original latency measurement). But to get a rough idea of where the costly allocations come from, it could be an easy way.
> Why did the author do things this way? I must be missing something.
Likely because it's easier to test against different allocators with a small replay tool than it is to try and get Doom3 to compile against dlmalloc, jemalloc, mimalloc, rpmalloc and tlsf.
I would also bet that getting the game to perform the exact same series of allocations would be an intractable problem to solve. I don't think Doom 3 has a benchmark mode; the author just recorded themselves loading the game, loading a level, doing a bit of gameplay, etc.
Carmack is a great empiricist, meaning he pays a lot of attention to how things work in a running system. All of his engines provide rich facilities for record and replay as well as running benchmarks.
There are lots of facilities for recording and playing back demos. The author could record their own gameplay and play it back in realtime vs running timedemo (which plays back as fast as possible).
id Tech games have "timedemo" mode, where they replay a pre-recorded demo as fast as possible, and report how many FPS it was able to process. These were very popular benchmarks long ago.
> I would have measured the time directly inside of the Doom allocator
Yeah, that'd be much better. It'd still only tell us how the allocator performed when having to share cache, memory B/W etc with Doom 3 though. Performance of similar allocations in a different application could be different. Even changing the screen resolution of Doom 3 might be enough to change the performance of the allocator. To get anywhere we need to understand what all the hardware components that alter performance are, the limits of their various resources and the penalties for running out of those resources. I normally find it is easier to remove the allocs from my program's main loop than to do this kind of analysis. In fact, I think this analysis is impossible. What happens when some background process like a virus checker decides to run? Doing less work in your program is almost always better. Moving the allocations out of the main loop normally makes the ownership easier to understand too, so helps to minimise code complexity.
I suspect you’d find Carmack’s code was “good enough that there was something more important to work on” rather than necessarily “optimal”. So it’s entirely possible there’s still some fruit to find, just no low-hanging ones.
Eleven million mallocs over seven minutes of gameplay at 60 fps comes out at an average of 437 mallocs per frame (this includes "load level" and "die and reload", so the average during gameplay is likely even lower).
The author writes "Doom 3 hits 60fps in debug and calls malloc all over the place.", but to me those stats suggest malloc-averse code.
Back in the 1980's I co-wrote a computer algebra system, that ran for example on many colleagues' first computer: The first Macintosh. Resources were scarce, everything mattered.
Malloc attempts to work acceptably well under diverse scenarios. Specific programs such as ours can adopt simpler memory disciplines, supported by an interface with malloc.
Almost all of our memory needs could be served by pools of fixed-size slugs of memory. Each pool experienced considerable churn and reuse, but the total memory needed by each pool was unimodal, growing to a peak then subsiding until the pool was no longer needed. We chose a universal fixed-size block to be served by malloc. We kept these blocks on a master stack, to assign to pools as needed. We kept the slugs in each pool on stacks, to reuse as needed. Closing a pool returned its blocks to the master stack.
There's nothing novel here; this is about as dumb a protocol as one can imagine, and correspondingly faster than anything else we tried.
I still believe the lesson today: One's core algorithms need to be crafted to cooperate with the fastest memory protocols. The doctor can't do everything without the patient's cooperation.
For example, I believe that the approach taken by Rust will ascend as we better learn how to work with it.
You mean zig, right? Zig is allocator-agnostic. It's hard (but not impossible) to implement type-specific allocators in Rust, and if you want the same type to take context-dependent allocator types, or context-dependent allocator instances, it's much , much harder, if I'm not mistaken. And the default way of programming Rust doesn't really tolerate allocation failures (which you might run up against with fixed memory slabs)
Game dev here ( with significant experience in engines with significant idTech ancestry ) - the goal is, generally, to avoid mallocs _during_ a frame - you alloc everything you need during startup. I'm surprised there's that much ongoing allocation.
Not sure what this empirical analysis adds to answer the initial question. Sure, modern computers are fast and modern operating systems are designed to reduce latency.
At the same time the “best practices” of not using mutexes and malloc on real-time threads are there for a reason: They potentially trigger a system call, which will add quite some latency (as the measurements show). Because real-time is all about deterministic latency, this is undesired.
I did not mean to discourage people from making their own evaluations. Maybe I did not word it correctly.
The article starts with questioning best practices for implementing real-time processing threads. The statistical analysis presented in the article is based on a single workload and a single machine, which in my opinion does not help to answer this question.
I agree that the empirical technique could be better, but the experiment is neat even if slightly flawed.
If the author wanted to show how much memory allocation could be done in an audio thread, they should construct an audio workload where they can tune the allocation and memory access patterns and then determine exactly how much is too much. But really that was the motivation, then as they admit in the start, they got nerd sniped and started measuring DOOM memory allocations (also fun). We should ignore the original motivation unless one uses a completely different experiment to justify something that wasn't tested.
The fact that all the allocators have the same tail performance suggests the measurement is polluted by something like thread preemption or processor power state transitions.
"The goal is to create a "journal" of memory operations. It should record malloc and free operations with their inputs and outputs. Then the journal can be replayed with different allocators to compare performance."
I don't think that is sufficient either. You need to mix the malloc/free with the other work, because the other work knocks your allocator data structures out of cache and pollutes your TLB and branch predictor. And loads up the dram interface (eg by the gpu fetching data from it). Etc etc.
reply