More

Pop_- · 2025-11-27T10:16:46 1764238606

I just scanned through their ToS, "pre-installed with the proprietary Linux OS". Looks not good to me lol.

Pop_- · on May 25, 2024

The github icon on the site directs to the author’s own page, and I couldn’t find any repository for the site, which makes me curious why do they even put the github link? Just for a follow?

toastal · on May 25, 2024

Marketing & social media pervade everything now

esnard · on May 25, 2024

It looks like the footer is identical to the footer on the author's own personal website, so I guess they never cared to change it?

tamimio · on May 25, 2024

Yeah, I was going to say it’s an awesome work but I looked around trying to find the repo.. and nothing was there. What’s the point of mentioning it runs on CF when you don’t provide the repo? This is just another SaaS

Pop_- · on Nov 29, 2023

I don't know why but this really makes me laugh

Pop_- · on Nov 29, 2023

Switching to non-default allocator does not always brings performance boost. It really depend on your workload, which requires profiling and benchmarking. But C/C++/Rust and other lower level languages should all at least be able to choose from these allocators. One caveat is binary size. Custom allocator does add more bytes to executable.

vlovich123 · on Nov 29, 2023

I don’t know why people still look to jemalloc. Mimalloc outperforms the standard allocator on nearly every single benchmark. Glibc’s allocator & jemalloc both are long in the tooth & don’t actually perform as well as state of the art allocators. I wish Rust would switch to mimalloc or the latest tcmalloc (not the one in gperftools).

masklinn · on Nov 29, 2023

> I wish Rust would switch to mimalloc or the latest tcmalloc (not the one in gperftools).

That's nonsensical. Rust uses the system allocators for reliability, compatibility, binary bloat, maintenance burden, ..., not because they're good (they were not when Rust switched away from jemalloc, and they aren't now).

If you want to use mimalloc in your rust programs, you can just set it as global allocator same as jemalloc, that takes all of three lines: https://github.com/purpleprotocol/mimalloc_rust#usage

If you want the rust compiler to link against mimilloc rather than jemalloc, feel free to test it out and open an issue, but maybe take a gander at the previous attempt: https://github.com/rust-lang/rust/pull/103944 which died for the exact same reason the the one before that (https://github.com/rust-lang/rust/pull/92249) did: unacceptable regression of max-rss.

vlovich123 · on Nov 29, 2023

I know it’s easy to change but the arguments for using glibc’s allocator are less clear to me:

1. Reliability - how is an alternate allocator less reliable? Seems like a FUD-based argument. Unless by reliability you mean performance in which case yes - jemalloc isn’t reliably faster than standard allocators, but mimalloc is.

2. Compatibility - again sounds like a FUD argument. How is compatibility reduced by swapping out the allocator? You don’t even have to do it on all systems if you want. Glibc is just unequivocally bad.

3. Binary bloat - This one is maybe an OK argument although I don’t know what size difference we’re talking about for mimalloc. Also, most people aren’t writing hello world applications so the default should probably be for a good allocator. I’d also note that having a dependency of the std runtime on glibc in the first place likely bloats your binary more than the specific allocator selected.

4. Maintenance burden - I don’t really buy this argument. In both cases you’re relying on a 3rd party to maintain the code.

masklinn · on Nov 29, 2023

> I know it’s easy to change but the arguments for using glibc’s allocator are less clear to me:

You can find them at the original motivation for removing jemalloc, 7 years ago: https://github.com/rust-lang/rust/issues/36963

Also it's not "glibc's allocator", it's the system allocator. If you're unhappy with glibc's, get that replaced.

> 1. Reliability - how is an alternate allocator less reliable?

Jemalloc had to be disabled on various platforms and architectures, there is no reason to think mimalloc or tcmalloc are any different.

The system allocator, while shit, is always there and functional, the project does not have to curate its availability across platforms.

> 2. Compatibility - again sounds like a FUD argument. How is compatibility reduced by swapping out the allocator?

It makes interactions with anything which does use the system allocator worse, and almost certainly fails to interact correctly with some of the more specialised system facilities (e.g. malloc.conf) or tooling (in rust, jemalloc as shipped did not work with valgrind).

> Also, most people aren’t writing hello world applications

Most people aren't writing applications bound on allocation throughput either

> so the default should probably be for a good allocator.

Probably not, no.

> I’d also note that having a dependency of the std runtime on glibc in the first place likely bloats your binary more than the specific allocator selected.

That makes no sense whatsoever. The libc is the system's and dynamically linked. And changing allocator does not magically unlink it.

> 4. Maintenance burden - I don’t really buy this argument.

It doesn't matter that you don't buy it. Having to ship, resync, debug, and curate (cf (1)) an allocator is a maintenance burden. With a system allocator, all the project does is ensure it calls the system allocators correctly, the rest is out of its purview.

vlovich123 · on Nov 29, 2023

The reason the reliability & compatibility arguments don’t make sense to me is that jemalloc is still in use for rustc (again - not sure why they haven’t switched to mimalloc) which has all the same platform requirements as the standard library. There’s also no reason an alternate allocator can’t be used on Linux specifically because glibc’s allocator is just bad full stop.

> It makes interactions with anything which does use the system allocator worse

That’s a really niche argument. Most people are not doing any of that and malloc.conf is only for people who are tuning the glibc allocator which is a silly thing to do when mimalloc will outperform whatever tuning you do (yes - glibc really is that bad).

> or tooling (in rust, jemalloc as shipped did not work with valgrind)

That’s a fair argument, but it’s not an unsolvable one.

> Most people aren’t writing applications bound on allocation throughput either

You’d be surprised at how big an impact the allocator can make even when you don’t think you’re bound on allocations. There’s also all sorts of other things beyond allocation throughput & glibc sucks at all of them (e.g. freeing memory, behavior in multithreaded programs, fragmentation etc etc).

> The libc is the system’s and dynamically linked. And changing allocator does not magically unlink it

I meant that the dependency on libc at all in the standard library bloats the size of a statically linked executable.

josephg · on Nov 29, 2023

> jemalloc is still in use for rustc (again - not sure why they haven’t switched to mimalloc)

Performance of rustc matters a lot! If the rust compiler runs faster when using mimalloc, please benchmark & submit a patch to the compiler.

masklinn · on Nov 29, 2023

I literally linked two attempts to use mimalloc in rustc just a few comments upthread.

josephg · on Nov 30, 2023

Ah - my mistake; I somehow misread your comment. Pity about the RSS regression.

Personally I have plenty of RAM and I'd happily use more in exchange for a faster compile. Its much cheaper to buy more ram than a faster CPU, but I certainly understand the choice.

With compilers I sometimes wonder if it wouldn't be better to just switch to an arena allocator for the whole compilation job. But it wouldn't surprise me if LLVM allocates way more memory than you'd expect.

vlovich123 · on Nov 29, 2023

Any links to instructions on how to run said benchmarks?

saagarjha · on Nov 30, 2023

Not to mention that by using the system allocator you get all sorts of things “for free” that the system developers provide for you, wrt observability and standard tooling. This is especially true of the OS and the allocator are shipped by one group rather than being developed independently.

charcircuit · on Nov 29, 2023

I've never not gotten increased performance by swapping outc the allocator.

Pop_- · on Nov 29, 2023

The root cause is AMD's bad support for rep movsb (which is a hardware problem). However, python by default has a small offset when reading memories while lower level language (rust and c) does not, which is why python seems to perform better than c/rust. It "accidentally" avoided the hardware problem.

formerly_proven · on Nov 29, 2023

That extra 0x20 (32 byte) offset is the size of the PyBytes object header for anyone wondering; 64 bits each for type object pointer, reference count, base pointer and item count.

mrweasel · on Nov 29, 2023

Thank you, because I was wondering if some Python developer found the same issue and decided to just implement the offset. It makes much more sense that it just happens to work out that way in Python.

meneer_oke · on Nov 29, 2023

It doesn't seem faster. Seem would imply that it isn't the case. It is faster currently on that setup.

But since python runtime is written in C, the issue can't be Python vs C.

topaz0 · on Nov 29, 2023

It's obviously not python vs c -- the time difference turns out to be in kernel code (system call) and not user code at all, and the post explicitly constructs a c program that doesn't have the slowdown by adding a memory offset. It just turns up by default in a comparison of python vs c code because python reads have a memory offset by default (for completely unrelated reasons) and analogous c reads don't by default. In principle you could also construct python code that does see this slowdown, it would just be much less likely to show up at random. So the python vs c comp is a total red herring here, it just happened to be what the author noticed and used as a hook to understand the problem.

TylerE · on Nov 29, 2023

C is a very wide target. There are plenty of things that one can do “in C” that no human would ever write. For instance, the C code generated by languages like nim and zig that essentially use C as a sort of IR.

meneer_oke · on Nov 29, 2023

That is true, With C allot of possible

> However, python by default has a small offset when reading memories while lower level language (rust and c)

Yet if the runtime is made with C, then that statement is incorrect.

bilkow · on Nov 29, 2023

By going through that line of thought, you could also argue that the slow implementation for the slow version in C and Rust is actually implemented in C, as memcpy is on glibc. Hence, Python being faster than Rust would also mean in this case that Python is faster than C.

The point is not that one language is faster than another. The point is that the default way to implement something in a language ended up being surprisingly faster when compared to other languages in this specific scenario due to a performance issue in the hardware.

In other words: on this specific hardware, the default way to do this in Python is faster than the default way to do this in C and Rust. That can be true, as Python does not use C in the default way, it adds an offset! You can change your implementation in any of those languages to make it faster, in this case by just adding an offset, so it doesn't mean that "Python is faster than C or Rust in general".

magicalhippo · on Nov 29, 2023

I recall when Pentium was introduced we were told to avoid rep and write a carefully tuned loop ourselves. To go really fast one could use the FPU to do the loads and stores.

Not too long ago I read in Intel's optimization guidelines that rep was now faster again and should be used.

Seems most of these things needs to be benchmarked on the CPU, as they change "all the time". I've sped up plenty of code by just replacing hand crafted assembly with high-level functional equivalent code.

Of course so-slow-it's-bad is different, however a runtime-determined implementation choice would avoid that as well.

CoastalCoder · on Nov 29, 2023

I'm not sure it makes sense to pin this only on AMD.

Whenever you're writing performance-critical software, you need to consider the relevant combinations of hardware + software + workload + configuration.

Sometimes a problem can be created or fixed by adjusting any one / some subset of those details.

hobofan · on Nov 29, 2023

If that's a bug that only happens with AMD CPUs, I think that's totally fair.

If we start adding in exceptions at the top of the software stack for individuals failures of specific CPUs/vendors, that seems like a strong regression from where we are today in terms of ergonomics of writing performance-critical software. We can't be writing individual code for each N x M x O x P combination of hardware + software + workload + configuration (even if you can narrow down the "relevant" ones).

jpc0 · on Nov 29, 2023

> We can't be writing individual code for each N x M x O x P combination of hardware + software + workload + configuration

That is kind of exactly what you would do when optimising for popular platforms.

If this error occurs on an AMD Cpu used by half your users is your response to your user going to be "just buy a different CPU" or are you going to fix it in code and ship a "performance improvement on XYZ platform" update

jacoblambda · on Nov 29, 2023

Nobody said "just buy a different CPU" anywhere in this discussion or the article. And they are pinning the root cause on AMD which is completely fair because they are the source of the issue.

Given that the fix is within the memory allocator, there is already a relatively trivial fix for users who really need it (recompile with jemalloc as the global memory allocator).

For everyone else, it's probably better to wait until AMD reports back with an analysis from their side and either recommends an "official" mitigation or pushes out a microcode update.

ansible · on Nov 29, 2023

The fix is that AMD needs to develop, test and deploy a microcode update for their affected CPUs, and then the problem is truly fixed for everyone, not just the people who have detected the issue and tried to mitigate it.

hobofan · on Nov 29, 2023

Yeah, but even if you'd take this on as your responsibility (while it should really be the CPU vendor fixing it), you would like to resolve it much lower in the stack, like the Rust compiler/standard library or LLVM, and not individually in any Rust library that happens to stumble upon that problem.

pmontra · on Nov 29, 2023

Well, if Excel would be running at half the speed (or half of LibreOffice Calc!) on half of the machines around here somebody at Redmond would notice, find the hardware bug and work around it.

I guess that in most big companies it suffices that there is a problem with their own software running on the laptop of a C* manager or of somebody close to there. When I was working for a mobile operator the antennas the network division cared about most were the ones close to the home of the CEO. If he could make his test calls with no problems they had the time to fix the problems of the rest of the network in all the country.

richardwhiuk · on Nov 29, 2023

You are going to be disappointed when you find out there's lots of architecture and CPU specific code in software libraries and the kernel.

hobofan · on Nov 29, 2023

That's completely fine in kernels and low-level libraries, but if I find that in a library as high-level as opendal, I'll definitely mark it down as a code smell.

Pop_- · on Nov 29, 2023

It's a known issue for AMD and has been tested by multiple people, and by the data provided by the author. It's fair to pin this problem to AMD.

Pop_- · on Nov 29, 2023

It's not stating python is faster than c in general. This is just one very specific case where non-page-aligned memeory reading on AMD is involved.

galangalalgol · on Nov 29, 2023

It does make me wonder why pymallov and jemalloc used page aligned memory, but glibc didn't. That is odd. Other questions never answered, why did pyo3 add so much overhead? it was over half the difference between the two.

xuanwo · on Nov 29, 2023

> It does make me wonder why pymallov and jemalloc used page aligned memory, but glibc didn't.

The root cause is not about page alignment. In fact, all allocators are aligned.

The root cause is AMD CPU didn't implement FSRM correctly while copying data from 0x1000 * n ~ 0x1000 * n + 0x10.

> Other questions never answered, why did pyo3 add so much overhead? it was over half the difference between the two.

OpenDAL Python Binding v0.42 does have many place to improve, like we can alloc the buffer in advance or using `read_buf` into uninit vec. I skipped this part since they are not the root cause.

scottlamb · on Nov 29, 2023

> It does make me wonder why pymallov and jemalloc used page aligned memory, but glibc didn't. That is odd.

Other way around: with glibc it was page-aligned; with the others, it wasn't.

This weird Zen performance quirk aside, I'd prefer page alignment so that an allocation like this which is a nice multiple of the page size doesn't waste anything (RAM or TLB), with the memory allocator's own bookkeeping in a separate block. Pretty surprising to me that the other allocators do something else.

Attummm · on Nov 29, 2023

The context of my initial comment is that python is slow, but can be fast.

From the article.

> In conclusion, the issue isn't software-related. Python outperforms C/Rust due to an AMD CPU bug.

Pop_- · on Nov 29, 2023

Disclaimer: The title has been changed to "Rust std fs slower than Python!? No, it's hardware!" to avoid clickbait. However I'm not able to fix the title in HN.

sharperguy · on Nov 29, 2023

"Works on contingency? No, money down!"

pvg · on Nov 29, 2023

you can mail hn@ycombinator.com and they can change it for you to whatever.

3cats-in-a-coat · on Nov 29, 2023

What's the TLDR on how... hardware performs differently on two software runtimes?

pornel · on Nov 29, 2023

AMD's implementation of `rep movsb` instruction is surprisingly slow when addresses are page aligned. Python's allocator happens to add a 16-byte offset that avoids the hardware quirk/bug.

sound1 · on Nov 29, 2023

thank you, upvoted!

lynndotpy · on Nov 29, 2023

One of the very first things in the article is a TLDR section that points you to the conclusion.

> In conclusion, the issue isn't software-related. Python outperforms C/Rust due to an AMD CPU bug.

j16sdiz · on Nov 29, 2023

It is software-related. Just the CPU perform badly on some software instruction.

xuanwo · on Nov 29, 2023

FSRM is a CPU feature embedded in the microcode (in this instance, amd-ucode) that software such as glibc cannot interact with. I refer to it as hardware because I consider microcode a part of the hardware.

Pop_- · on Nov 29, 2023

The author has updated the title and also contacted me. But unfortunately I'm no longer able to update it so.

Pop_- · on March 30, 2023

Immutable doesn’t mean it cannot be updated. It just means that updates are happening without touching the same copy. This CoW-ish style is what they’re selling.

Pop_- · on Dec 27, 2022

Thanks for such a detailed introduction! This along can be made as a blog post tbh.