The github icon on the site directs to the author’s own page, and I couldn’t find any repository for the site, which makes me curious why do they even put the github link? Just for a follow?
Yeah, I was going to say it’s an awesome work but I looked around trying to find the repo.. and nothing was there. What’s the point of mentioning it runs on CF when you don’t provide the repo? This is just another SaaS
Switching to non-default allocator does not always brings performance boost. It really depend on your workload, which requires profiling and benchmarking. But C/C++/Rust and other lower level languages should all at least be able to choose from these allocators. One caveat is binary size. Custom allocator does add more bytes to executable.
I don’t know why people still look to jemalloc. Mimalloc outperforms the standard allocator on nearly every single benchmark. Glibc’s allocator & jemalloc both are long in the tooth & don’t actually perform as well as state of the art allocators. I wish Rust would switch to mimalloc or the latest tcmalloc (not the one in gperftools).
> I wish Rust would switch to mimalloc or the latest tcmalloc (not the one in gperftools).
That's nonsensical. Rust uses the system allocators for reliability, compatibility, binary bloat, maintenance burden, ..., not because they're good (they were not when Rust switched away from jemalloc, and they aren't now).
If you want the rust compiler to link against mimilloc rather than jemalloc, feel free to test it out and open an issue, but maybe take a gander at the previous attempt: https://github.com/rust-lang/rust/pull/103944 which died for the exact same reason the the one before that (https://github.com/rust-lang/rust/pull/92249) did: unacceptable regression of max-rss.
I know it’s easy to change but the arguments for using glibc’s allocator are less clear to me:
1. Reliability - how is an alternate allocator less reliable? Seems like a FUD-based argument. Unless by reliability you mean performance in which case yes - jemalloc isn’t reliably faster than standard allocators, but mimalloc is.
2. Compatibility - again sounds like a FUD argument. How is compatibility reduced by swapping out the allocator? You don’t even have to do it on all systems if you want. Glibc is just unequivocally bad.
3. Binary bloat - This one is maybe an OK argument although I don’t know what size difference we’re talking about for mimalloc. Also, most people aren’t writing hello world applications so the default should probably be for a good allocator. I’d also note that having a dependency of the std runtime on glibc in the first place likely bloats your binary more than the specific allocator selected.
4. Maintenance burden - I don’t really buy this argument. In both cases you’re relying on a 3rd party to maintain the code.
Also it's not "glibc's allocator", it's the system allocator. If you're unhappy with glibc's, get that replaced.
> 1. Reliability - how is an alternate allocator less reliable?
Jemalloc had to be disabled on various platforms and architectures, there is no reason to think mimalloc or tcmalloc are any different.
The system allocator, while shit, is always there and functional, the project does not have to curate its availability across platforms.
> 2. Compatibility - again sounds like a FUD argument. How is compatibility reduced by swapping out the allocator?
It makes interactions with anything which does use the system allocator worse, and almost certainly fails to interact correctly with some of the more specialised system facilities (e.g. malloc.conf) or tooling (in rust, jemalloc as shipped did not work with valgrind).
> Also, most people aren’t writing hello world applications
Most people aren't writing applications bound on allocation throughput either
> so the default should probably be for a good allocator.
Probably not, no.
> I’d also note that having a dependency of the std runtime on glibc in the first place likely bloats your binary more than the specific allocator selected.
That makes no sense whatsoever. The libc is the system's and dynamically linked. And changing allocator does not magically unlink it.
> 4. Maintenance burden - I don’t really buy this argument.
It doesn't matter that you don't buy it. Having to ship, resync, debug, and curate (cf (1)) an allocator is a maintenance burden. With a system allocator, all the project does is ensure it calls the system allocators correctly, the rest is out of its purview.
The reason the reliability & compatibility arguments don’t make sense to me is that jemalloc is still in use for rustc (again - not sure why they haven’t switched to mimalloc) which has all the same platform requirements as the standard library. There’s also no reason an alternate allocator can’t be used on Linux specifically because glibc’s allocator is just bad full stop.
> It makes interactions with anything which does use the system allocator worse
That’s a really niche argument. Most people are not doing any of that and malloc.conf is only for people who are tuning the glibc allocator which is a silly thing to do when mimalloc will outperform whatever tuning you do (yes - glibc really is that bad).
> or tooling (in rust, jemalloc as shipped did not work with valgrind)
That’s a fair argument, but it’s not an unsolvable one.
> Most people aren’t writing applications bound on allocation throughput either
You’d be surprised at how big an impact the allocator can make even when you don’t think you’re bound on allocations. There’s also all sorts of other things beyond allocation throughput & glibc sucks at all of them (e.g. freeing memory, behavior in multithreaded programs, fragmentation etc etc).
> The libc is the system’s and dynamically linked. And changing allocator does not magically unlink it
I meant that the dependency on libc at all in the standard library bloats the size of a statically linked executable.
Ah - my mistake; I somehow misread your comment. Pity about the RSS regression.
Personally I have plenty of RAM and I'd happily use more in exchange for a faster compile. Its much cheaper to buy more ram than a faster CPU, but I certainly understand the choice.
With compilers I sometimes wonder if it wouldn't be better to just switch to an arena allocator for the whole compilation job. But it wouldn't surprise me if LLVM allocates way more memory than you'd expect.
Not to mention that by using the system allocator you get all sorts of things “for free” that the system developers provide for you, wrt observability and standard tooling. This is especially true of the OS and the allocator are shipped by one group rather than being developed independently.
The root cause is AMD's bad support for rep movsb (which is a hardware problem). However, python by default has a small offset when reading memories while lower level language (rust and c) does not, which is why python seems to perform better than c/rust. It "accidentally" avoided the hardware problem.
That extra 0x20 (32 byte) offset is the size of the PyBytes object header for anyone wondering; 64 bits each for type object pointer, reference count, base pointer and item count.
Thank you, because I was wondering if some Python developer found the same issue and decided to just implement the offset. It makes much more sense that it just happens to work out that way in Python.
It's obviously not python vs c -- the time difference turns out to be in kernel code (system call) and not user code at all, and the post explicitly constructs a c program that doesn't have the slowdown by adding a memory offset. It just turns up by default in a comparison of python vs c code because python reads have a memory offset by default (for completely unrelated reasons) and analogous c reads don't by default. In principle you could also construct python code that does see this slowdown, it would just be much less likely to show up at random. So the python vs c comp is a total red herring here, it just happened to be what the author noticed and used as a hook to understand the problem.
C is a very wide target. There are plenty of things that one can do “in C” that no human would ever write. For instance, the C code generated by languages like nim and zig that essentially use C as a sort of IR.
By going through that line of thought, you could also argue that the slow implementation for the slow version in C and Rust is actually implemented in C, as memcpy is on glibc. Hence, Python being faster than Rust would also mean in this case that Python is faster than C.
The point is not that one language is faster than another. The point is that the default way to implement something in a language ended up being surprisingly faster when compared to other languages in this specific scenario due to a performance issue in the hardware.
In other words: on this specific hardware, the default way to do this in Python is faster than the default way to do this in C and Rust. That can be true, as Python does not use C in the default way, it adds an offset! You can change your implementation in any of those languages to make it faster, in this case by just adding an offset, so it doesn't mean that "Python is faster than C or Rust in general".
I recall when Pentium was introduced we were told to avoid rep and write a carefully tuned loop ourselves. To go really fast one could use the FPU to do the loads and stores.
Not too long ago I read in Intel's optimization guidelines that rep was now faster again and should be used.
Seems most of these things needs to be benchmarked on the CPU, as they change "all the time". I've sped up plenty of code by just replacing hand crafted assembly with high-level functional equivalent code.
Of course so-slow-it's-bad is different, however a runtime-determined implementation choice would avoid that as well.
I'm not sure it makes sense to pin this only on AMD.
Whenever you're writing performance-critical software, you need to consider the relevant combinations of hardware + software + workload + configuration.
Sometimes a problem can be created or fixed by adjusting any one / some subset of those details.
If that's a bug that only happens with AMD CPUs, I think that's totally fair.
If we start adding in exceptions at the top of the software stack for individuals failures of specific CPUs/vendors, that seems like a strong regression from where we are today in terms of ergonomics of writing performance-critical software. We can't be writing individual code for each N x M x O x P combination of hardware + software + workload + configuration (even if you can narrow down the "relevant" ones).
> We can't be writing individual code for each N x M x O x P combination of hardware + software + workload + configuration
That is kind of exactly what you would do when optimising for popular platforms.
If this error occurs on an AMD Cpu used by half your users is your response to your user going to be "just buy a different CPU" or are you going to fix it in code and ship a "performance improvement on XYZ platform" update
Nobody said "just buy a different CPU" anywhere in this discussion or the article. And they are pinning the root cause on AMD which is completely fair because they are the source of the issue.
Given that the fix is within the memory allocator, there is already a relatively trivial fix for users who really need it (recompile with jemalloc as the global memory allocator).
For everyone else, it's probably better to wait until AMD reports back with an analysis from their side and either recommends an "official" mitigation or pushes out a microcode update.
The fix is that AMD needs to develop, test and deploy a microcode update for their affected CPUs, and then the problem is truly fixed for everyone, not just the people who have detected the issue and tried to mitigate it.
Yeah, but even if you'd take this on as your responsibility (while it should really be the CPU vendor fixing it), you would like to resolve it much lower in the stack, like the Rust compiler/standard library or LLVM, and not individually in any Rust library that happens to stumble upon that problem.
Well, if Excel would be running at half the speed (or half of LibreOffice Calc!) on half of the machines around here somebody at Redmond would notice, find the hardware bug and work around it.
I guess that in most big companies it suffices that there is a problem with their own software running on the laptop of a C* manager or of somebody close to there. When I was working for a mobile operator the antennas the network division cared about most were the ones close to the home of the CEO. If he could make his test calls with no problems they had the time to fix the problems of the rest of the network in all the country.
That's completely fine in kernels and low-level libraries, but if I find that in a library as high-level as opendal, I'll definitely mark it down as a code smell.
It does make me wonder why pymallov and jemalloc used page aligned memory, but glibc didn't. That is odd. Other questions never answered, why did pyo3 add so much overhead? it was over half the difference between the two.
> It does make me wonder why pymallov and jemalloc used page aligned memory, but glibc didn't.
The root cause is not about page alignment. In fact, all allocators are aligned.
The root cause is AMD CPU didn't implement FSRM correctly while copying data from 0x1000 * n ~ 0x1000 * n + 0x10.
> Other questions never answered, why did pyo3 add so much overhead? it was over half the difference between the two.
OpenDAL Python Binding v0.42 does have many place to improve, like we can alloc the buffer in advance or using `read_buf` into uninit vec. I skipped this part since they are not the root cause.
> It does make me wonder why pymallov and jemalloc used page aligned memory, but glibc didn't. That is odd.
Other way around: with glibc it was page-aligned; with the others, it wasn't.
This weird Zen performance quirk aside, I'd prefer page alignment so that an allocation like this which is a nice multiple of the page size doesn't waste anything (RAM or TLB), with the memory allocator's own bookkeeping in a separate block. Pretty surprising to me that the other allocators do something else.
Disclaimer: The title has been changed to "Rust std fs slower than Python!? No, it's hardware!" to avoid clickbait. However I'm not able to fix the title in HN.
AMD's implementation of `rep movsb` instruction is surprisingly slow when addresses are page aligned. Python's allocator happens to add a 16-byte offset that avoids the hardware quirk/bug.
FSRM is a CPU feature embedded in the microcode (in this instance, amd-ucode) that software such as glibc cannot interact with. I refer to it as hardware because I consider microcode a part of the hardware.
Immutable doesn’t mean it cannot be updated. It just means that updates are happening without touching the same copy. This CoW-ish style is what they’re selling.