Erik McClure

The Windows malloc() Implementation Is A Trash Fire


I am currently porting an experimental language to Windows. This experimental language was built in C++ with LLVM, and relies heavily on GCC extensions like VLAs and Compound Statement Expressions, which basically made it impossible to build with MSVC (although I have a truly horrifying idea I may attempt later). Luckily, you can now build things on Windows with Clang, which solves a lot of problems. However, clang-cl simply compiles the code - it still uses the Microsoft C++ headers and links to the Microsoft C++ runtime. This is a good thing, because it ensures maximum compatibility with win32 APIs and other Windows executables.

Unfortunately, it also means you get the Windows malloc() implementation from MSVCRT (specifically, it's statically linking with the CRT shipped with Visual Studio), which is quite possibly one of the worst piles of rotten garbage ever compiled in the history of C. I learned how to program, like many people did, by doing indie game dev. Like many people, I never released a single game, but I did write a bunch of code which now lies forgotten in lost GitHub repositories. I was taught that to allocate memory was to summon death itself to ruin your performance. A single call to malloc() during any frame is likely to render your game unplayable. Any sort of allocations that needed to happen with any regularity required writing a custom, purpose-built allocator, usually either a fixed-size block allocator using a freelist, or a greedy allocator freed after the level ended. Even more optimization could be made by using thread-local storage, to maintain thread-specific allocators without ever needing to waste time on concurrency.

It turns out you don't actually need any of this on Linux. You can basically malloc() whatever you want (within reason) and it'll be surprisingly fast.

LLVM was built for Linux - or rather, it was built to work for Mac OSX, which is POSIX compliant and looks like a Linux system if you squint. Most optimizations are designed to make it go fast on Mac or Linux. Since it's a compiler, it makes a lot of tiny allocations, because it basically represents control flow as a gigantic digraph. I actually thought it used a custom allocator for allocating these tiny nodes, because that's what I would've done, but in reality, it just calls new everywhere and lets the Linux malloc() implementation deal with it. The reason I care is because this experimental language I'm working on needs to JIT its core library when booting up - it takes about 1.1 seconds to do this on Linux, and 7 seconds on Windows.

At first, I thought this inefficiency came from this experimental language utilizing std::unordered_map everywhere, since this allocates a new chunk of memory for every single item to ensure iterator stability, and is well known for being incredibly inefficient compared to basically any other hash implementation. I substituted this with Google's Abseil flat_hash_map implementation, and achieved an impressive 2x speedup on Windows, dropping the startup time to 4 seconds. Pretty good, and in line with what I expected. Can you guess what the corresponding Linux speedup was?

Nothing.

Literally. Fucking. NOTHING.

We tested this both on my WSL instance on Windows and a native NixOS desktop, with identical results. Deciding to forge on ahead, I found another inefficient area of the compiler that was needlessly allocating an entire vector to support bignum types, even though none of the tests ever actually used this, so the vector ended up only holding a single integer. I changed it to skip the vector if there was only 1 element. This cut another second from Windows. It appeared to either do nothing on Linux, or somehow made it slightly slower. Worse, based on my profiler analysis, I was running out of stuff outside of LLVM to optimize - the remaining time was mostly LLVM.

Okay, fuck it, this is clearly an allocator issue, and incidentally, some game developers had started using LLVM for… some reason, and of course, most games work on Windows, so they needed LLVM to be fast on Windows. They introduced a very janky way of replacing malloc() in LLVM, which I kinda had to hack into a custom vcpkg fork to make it work with my dependency handling, but it seemed to work!

Except it made LLVM crash.

It turns out that if you replace malloc() with either rpmalloc or mimalloc, the windows kernel would sometimes trigger a debug break instruction while inside std::recursive_mutex, but only if you are JITing code. This happens even if you disable threading in LLVM. At first I thought this was some kind of ABI mismatch (since this has happened before), but no amount of tweaking things to match up fixed the issue. I spoke with a personal friend of mine who happens to work at Apple on LLVM, and they suspect this is a kernel sanity check intended to catch potential deadlocks, maybe because the critical section isn't re-entrant and something screwed up. However, this was inside of a std::recursive_mutex implementation, so… it should be re-entrant, by definition. There may simply be a race-condition or implementation error somewhere inside LLVM, but I'm really not in the mood to debug extremely arcane multithreading issues in LLVM.

So, instead, I did the worst hack of my entire life by replacing the efficient std::recursive_mutex implementation that used critical sections with an extremely inefficient re-entrant spinlock.

This actually worked, and instantly made the startup time on windows approximately 1.2 seconds, now within the realm of the Linux start times. The fucking Windows allocator was the source of all of my performance problems the entire time. aganea, who has my undying gratitude for taking over the task of unfucking the catastrophic trainwreck that was LLD, happens to also be the one who introduced the method of patching LLVM's allocator. Sadly, it seems whatever they were using LLVM for did not involve JITing code, or may have used a different method, as nobody seems to have run into this problem except me.

The best part about this whole fiasco is that mimalloc was developed by Microsoft for Microsoft services because of how slow Microsoft's own MSVCRT malloc() was. So this entire time I've been trying to replace Microsoft's malloc() with Microsoft's other malloc() because Microsoft's malloc() was too slow for Microsoft. For some insane reason, mimalloc is not shipped in Visual Studio, not even as something that you have to opt-in to (in case there's some backwards compatibility concern). Microsoft's instructions provide operator new overloads instead of actually integrating this with their own compiler despite being nearly an order of magnitude faster in rapid small allocation scenarios!

So, at this point, we have learned three things: the Windows allocator is a complete trash fire, Microsoft invented an alternative to their own allocator but refuse to provide it as an alternative, and there's clearly some kind of race condition in LLVM's JIT in certain edge cases related to the usage of mutexes on Windows. If someone wants to try to reproduce this and file a proper bug on LLVM, go ahead, but it would take days to distill a minimal example for this and I honestly have better things to do right now. The inefficient spinlock doesn't matter when threading is disabled because it has no contention, and having threading enabled does not appear to actually make my use-case go any faster anyway, for whatever reason.

Everything is broken and I am too tired to do anything but contribute another obscene hack on top of this pile of cards we have built modern civilization on.


Avatar

Archive

  1. 2022
  2. 2021
  3. 2020
  4. 2019
  5. 2018
  6. 2017
  7. 2016
  8. 2015
  9. 2014
  10. 2013
  11. 2012
  12. 2011
  13. 2010
  14. 2009