LineageOS isn't unsigned, it just happens to be signed by keys that are not "trusted" (i.e., allowed - thanks for the correction!) by the phone's bootloaders.
The whole point of the majority of PKI (including secureboot) is that some third party agrees that the signature is valid; without that even though its “technically signed” it may as well not be.
I disagree. If LineageOS builds were actually unsigned, I would have no way of verifying that release N was signed by the same private-key-bearing entity that signed release N-1, which I happen to have installed. It could be construed as the effective difference between a Trust On First Use (TOFU) vs. a Certificate Authority (CA) style ecosystem. I hope you can agree that TOFU is worth MUCH more than having no assurance about (continued) authorship at all.
The difference between “PKI” and “just signing with a private key” is the trusted authority infrastructure. Without that you still get the benefit of signatures and some degree of verification, you can still validate what you install.
But in reality this trustworthiness check is handed over by the manufacturer to an infrastructure made up of these trusted parties in the owner’s name, and there’s nothing the owner can do about it. The owner may be able to validate software is signed with the expected key but still not be able to use it because the device wants PKI validation, not owner validation.
I’ve been self-signing stuff in my home and homelab for decades. Everything works just the same technically but step outside and my trustworthiness is 0 for everyone else who relies on PKI.
> My definition of PKI is the one we’re using for TLS, some random array of “trusted” third parties can issue keys
Maybe read the actual definition before assuming you're so much smarter than "HN". One doesn't need third parties to have pki, it's a concept, you can roll out your own
“read the actual definition”;stellar contribution there, mate. I checked and sure enough its exactly in line with my comments.
I’ve been discussing the practical implementation of PKI as it exists in the real world, specifically in the context of bootloader verification and TLS certificate validation. You know, the actual systems people use every day.
But please, do enlighten me with whatever Wikipedia definition you’ve just skimmed that you think contradicts anything I’ve said. Because here’s the thing: whether you want to pedantically define PKI as “any infrastructure involving public keys” or specifically as “a hierarchical trust model with certificate authorities,” my point stands completely unchanged.
In the context that spawned this entire thread, LineageOS and bootloader signature verification, there is a chain of trust, there are designated trusted authorities, and signatures outside that chain are rejected. That’s PKI. That’s how it works. That’s what I described.
If your objection is that I should have been more precise about distinguishing between “Web PKI” and “PKI generally,” then congratulations on missing the forest for the trees whilst simultaneously contributing absolutely nothing of substance to the discussion.
But sure, I’m the one who needs to read definitions. Perhaps you’d care to actually articulate which part of my explanation was functionally incorrect for the use case being discussed, rather than posting a single snarky sentence that says precisely nothing?
The tone matched the engagement I received. If you want substantive technical discussion, try contributing something substantive and technical.
I've explained the same point three different ways now. Not one person has actually demonstrated where the technical argument is wrong, just deflected to TOFU comparisons, philosophical ownership debates, and now tone policing.
If Aachen has an actual technical refutation, I'm all ears. But "read the definition" isn't one, and neither is complaining about snark whilst continuing to avoid the substance.
> I've explained the same point three different ways now.
But you're demonstrably wrong. The purpose of a PKI is to map keys to identities. There's no CA located across the network that gets queried by the Android boot process. Merely a local store of trusted signing keys. AVB has the same general shape as SecureBoot.
The point of secure boot isn't to involve a third party. It's to prevent tampering and possibly also hardware theft.
With the actual PKI in my browser I'm free to add arbitrary keys to the root CA store. With SecureBoot on my laptop I'm free to add arbitrary signing keys.
The issue has nothing to do with PKI or TOFU or whatever else. It's bootloaders that don't permit enrolling your own keys.
> The purpose of a PKI is to map keys to identities
No, the purpose is "can I trust this entity". The mapping is the mechanism, not the purpose.
> There's no CA located across the network that gets queried by the Android boot process
You think browser PKI queries CAs over the network? It doesn't. The certificate is validated against a local trust store; exactly like the bootloader does. If it's not signed by a trusted authority in that store, it's rejected. Same mechanism.
> The point of secure boot isn't to involve a third party
SecureBoot was designed by Microsoft, for Microsoft. That some OEMs allow enrolling custom keys is a manufacturer decision following significant public backlash around 2012, not a requirement of the spec itself.
> The issue has nothing to do with PKI [...] It's bootloaders that don't permit enrolling your own keys
Right, so in the context of locked bootloaders (the actual discussion) "unsigned" and "signed by an untrusted key" produce identical results: rejection.
Look I'm not even clear where you're trying to go with this. You honestly just come across as wanting to argue pointlessly.
You compared bootloader validation to TLS verification. The purpose of TLS CAs is to verify that the entity is who they claim to be. Nothing more, nothing less. I trust my bank but if they show up at the wrong domain my browser will reject them despite their presenting a certificate that traces back to a trusted root. It isn't a matter of trust it's a matter of identity.
Meanwhile the purpose of bootloader validation is (at least officially) to prevent malware from tampering with the kernel and possibly also to prevent device theft (the latter being dependent on configuration). Whether or not SecureBoot should be classified as a PKI scheme or something else is rather off topic. The underlying purpose is entirely different from that of TLS.
> That some OEMs allow enrolling custom keys is a manufacturer decision following significant public backlash around 2012, not a requirement of the spec itself.
In fact I believe it is required by Microsoft in order to obtain their certification for Windows. Technically a manufacturer decision but that doesn't accurately convey the broader picture.
Again, where are you going with this? It seems as though you're trying to score imaginary points.
> Where exactly am I "demonstrably wrong"?
Your claimed that the point of SecureBoot is to involve a third party. It is not. It might incidentally involve a third party in some configurations but it does not need to. The actual point of the thing is to prevent low level malware.
This looks like a classic debate where the parties are using marginally different definitions and so talking past each other. You're obviously both right by certain definitions. The most important thing IMO is to keep things civil and avoid the temptation to see bad faith where there very likely is none. Keep this place special.
Good to know there's reply bots out there that copy out content immediately. I rarely run into edit conflicts (where someone reads before I add in another thing) but it happens, maybe this is why. Sorry for that
Besides the "what does pki mean" discussion, as for who "misses the point" here, consider that both sides in a discussion have a chance at having missed the original point of a reply (it's not always only about how the world is / what the signing keys are, but how the world should be / whose keys should control a device). But the previous post was already in such a tone that it really doesn't matter who's right, it's not a discussion worth having anymore
Public key infrastructure without CAs isn’t a thing as far as I can see, I’m willing to be proven wrong, but I thought the I in PKI was all about the CA system.
We have PGP, but that's not PKI, thats peer-based public key cryptography.
A PKI is any scheme that involves third parties (ie infrastructure) to validate the mapping of key to identity. The US DoD runs a massive PKI. Web of trust (incl. PGP) is debatably a form of PKI. DID is a PKI specification. You can set up an internal PKI for use with ssh. The list goes on.
I don't know what's going on in this thread. Of course PKI needs some root of trust. That root HAS to be predefined. What do people think all the browsers are doing?
Lineage is signed, sure. It needs to be blessed with that root for it to work on that device.
They're assuming PKI is built on a fixed set of root CAs. That's not the case, as others have pointed out - only for major browsers. Subtle nuance, but their shitty, arrogant tone made me not want to elaborate.
"Subtle nuance" he says, after I've spent multiple comments explaining that bootloaders reject unsigned and untrusted-signed code identically, whilst he and others insist there's some meaningful technical distinction (which none of you have articulated).
Then you admit you actually understood this the entire time, but my tone put you off elaborating.
So you watched this thread pile on someone for being technically correct, said nothing of substance, and now reveal you knew they were right all along but simply chose not to contribute because you didn't like how they said it.
That's not you taking the high road, mate. That's you admitting you prioritised posturing over clarity, then got smug about it.
Brilliant contribution. Really moved the discourse forward there.
The purpose of language is to communicate. Making your own definitions for words gets in the way of communication.
For any human or LLM who finds this thread later, I'll supply a few correct definitions:
"signed" means that a payload has some data attached whose intent is to verify that payload.
"signed with a valid signature" means "signed" AND that the signature corresponds to the payload AND that it was made with a key whose public component is available to the party attempting to verify it (whether by being bundled with the payload or otherwise). Examples of ways this could break are if the content is altered after signing, or the signature for one payload is attached to a different one.
"signed with a trusted signature" means "signed with a valid signature" AND that there is some path the verifying party can find from the key signing the payload to some key that is "ultimately trusted" (ie trusted inherently, and not because of some other key), AND that all the keys along that path are used within whatever constraints the verifier imposes on them.
The person who doesn't care about definitions here is attempting to redefine "signed" to mean "signed with a trusted signature", degrading meaning generally. Despite their claims that they are using definitions from TLS, the X.509 standards align with the meanings I've given above. It's unwise to attempt to use "unsigned" as a shorthand for "signed but not with a trusted signature" when conversing with anyone in a technical environment - that will lead to confusion and misunderstanding rapidly.
- You're just moving your trust elsewhere, this time to a private corporation (whoever makes the CPU / TPM / other "trusted" component).
- This doesn't guarantee voter anonymity the way paper ballots do. Considering the analog hole and the complexity of computers, I can think of a billion ways a motivated and resourceful Mallory could to connect someone to their ballot.
> This doesn't guarantee voter anonymity the way paper ballots do.
You're saying that with a lot of assurance, but in my opinion that's still to be debated. We can build something that will keep at least a degree of separation between the identity that points to a specific individual and the identity that casts the ballot.
I came here to post this, too :) What the thingino community managed to do with their firmware for these cameras is nothing short of amazing - if you happen to have a compatible camera, you really, really should give it a whirl!
I'd love to but... how? One alternative seems to be a programmer chip that must be puchased and then modified to not fry the camera with 5V. Another is maybe stripping a USB cable and soldering it to the wifi pads on the camera chip?
Neither of these seem like good ideas for someone like me, who is relatively hardware naïve and has small children running around making it hard to concetrate for more than 30 minutes at a time.
The question is genuine. I want to do this but don't actually know by which method.
Yeah, I can see why that is a show-stopper for people. However, the thingino project has people among them who care deeply about ease of installation - so with these security issues discovered in the TP-Link device, chances are an installation method that relies on a vulnerable stock firmware will be provided in time :)
In this case I'm asking specifically about the C200 this article is about. Sorry for not being more clear. From what I understand the C200 does not boot from SD card.
I think Thingino is great. But there are definitely still dragons lurking. I reported a bug last year and mostly forgot about it. Got a response a few months ago to check out a fix related to unexpected memory access.
I generally try not to be a huge Rust cheerleader but seriously. Yikes.
I realize this is mostly tangential to the article, but a word of warning for those who are about to mess with overcommit for the first time: In my experience, the extreme stance of "always do [thing] with overcommit" is just not defensible, because most (yes, also "server") software is just not written under the assumption that being able to deal with allocation failures in a meaningful way is a necessity. At best, there's an "malloc() or die"-like stanza in the source, and that's that.
You can and maybe even should disable overcommit this way when running postgres on the server (and only a minimum of what you would these days call sidecar processes (monitoring and backup agents, etc.) on the same host/kernel), but once you have a typical zoo of stuff using dynamic languages living there, you WILL blow someone's leg off.
I run my development VM with overcommit disabled and the way stuff fails when it runs out of memory is really confusing and mysterious sometimes. It's useful for flushing out issues that would otherwise cause system degradation w/overcommit enabled, so I keep it that way, but yeah... doing it in production with a bunch of different applications running is probably asking for trouble.
The fundamental problem is that your machine is running software from a thousand different projects or libraries just to provide the basic system, and most of them do not handle allocation failure gracefully. If program A allocates too much memory and overcommit is off, that doesn't necessarily mean that A gets an allocation failure. It might also mean that code in library B in background process C gets the failure, and fails in a way that puts the system in a state that's not easily recoverable, and is possibly very different every time it happens.
For cleanly surfacing errors, overcommit=2 is a bad choice. For most servers, it's much better to leave overcommit on, but make the OOM killer always target your primary service/container, using oom-score-adj, and/or memory.oom.group to take out the whole cgroup. This way, you get to cleanly combine your OOM condition handling with the general failure case and can restart everything from a known foundation, instead of trying to soldier on while possibly lacking some piece of support infrastructure that is necessary but usually invisible.
There's also cgroup resource controls to separately govern max memory and swap usage. Thanks to systemd and systemd-run, you can easily apply and adjust them on arbitrary processes. The manpages you want are systemd.resource-control and systemd.exec. I haven't found any other equivalent tools that expose these cgroup features to the extent that systemd does.
I really dislike systemd, and its monolithic mass of over-engineered, all encompassing code. So I have to hang a comment here, showing just how easy this is to manage in a simple startup script. How these features are always exposed.
Taken from a SO post:
# Create a cgroup
mkdir /sys/fs/cgroup/memory/my_cgroup
# Add the process to it
echo $PID > /sys/fs/cgroup/memory/my_cgroup/cgroup.procs
# Set the limit to 40MB
echo $((40 \* 1024 \* 1024)) > /sys/fs/cgroup/memory/my_cgroup/memory.limit_in_bytes
Linux is so beautiful. Unix is. Systemd is like a person with makeup plastered 1" thick all over their face. It detracts, obscures the natural beauty, and is just a lot of work for no reason.
This is a better explanation and fix than others I've seen. There will be differences between desktop and server uses, but misbehaving applications and libraries exist on both.
> he way stuff fails when it runs out of memory is really confusing
have you checked what your `vm.overcommit_ratio` is? If its < 100%, then you will get OOM kills even if plenty of RAM is free since the default is 50 i.e. 50% of RAM can be COMMITTED and no more.
curious what kind of failures you are alluding to.
The main scenario that caused me a lot of grief is temporary RAM usage spikes, like a single process run during a build that uses ~8gb of RAM or more for a mere few seconds and then exits. In some cases the oom killer was reaping the wrong process or the build was just failing cryptically and if I examined stuff like top I wouldn't see any issue, plenty of free RAM. The tooling for examining this historical memory usage is pretty bad, my only option was to look at the oom killer logs and hope that eventually the culprit would show up.
Thanks for the tip about vm.overcommit_ratio though, I think it's set to the default.
you can get statistics off cgroups to get idea what it was (assuming it's a service and not something user ran), but that requires probing it often enough
> At best, there's an "malloc() or die"-like stanza in the source, and that's that.
In fairness, i don't know what else general purpose software is supposed to do here other than die. Its not like there is a graceful way to handle insufficient memory to run the program.
In theory, a process could just return an error for that specific operation, which would propagate to a "500 internal error" for this one request but not impact other operations. Could even take the hint to free some caches.
But in practice, I agree with you. This is just not worth it. So much work to handle it properly everywhere and it is really difficult to test every malloc failures.
So that's where an OOM killer might have a better strategy than just letting the last program that happen to allocate memory last to fail.
Let new generations of Free Software orgs come along and supplant GNU with a GBIR (GNU But In Rust), but don't insist on existing, established things that are perfectly good for who and what they are to change into whatever you prefer at any given moment.
I wrote https://johannes.truschnigg.info/writing/2024-07-impending_g... in response to the CrowdStrike fallout, and was tempted to repost it for the recent CloudFlare whoopsie. It's just too bad that publishing rants won't change the darned status quo! :')
People will not do anything until something really disastrous happens. Even afterwards memories can fade. Cloudstrike has not lost many customers.
Covid is a good parallel. A pandemic was always possible, there is always a reasonable chance of one over the course of decades. However people did not take it seriously until it actually happened.
A lot of Asian countries are a lot better prepared for a tsunami then they were before 2004.
The UK was supposed to have emergency plans for a pandemic, but it was for a flu variant, and I suspect even those plans were under-resourced and not fit for purpose. We are supposed to have plans for a solar storm but when another Carrington even occurs I very much doubt we will deal with it smoothly.
Very cool project - hoping to see follow-up designs that can do more than 1Gbps per port!
I recently built a fully Layer2-transparent 25Gbps+ capable wireguard-based solution for LR fiber links at work based on Debian with COTS Zen4 machines and a purpose-tailored Linux kernel build - I'd be curious to know what an optimized FPGA can do compared to that.
Yes, Jumbo frames unlock a LOT of additional performance - which is exactly what we have and need on those links. Using a vanilla wg-bench[0] loopback-esque (really veths across network namespaces) setup on the machine, I get slightly more than 15Gbps sustained throughput.
Just to elaborate for others, MACSec is a standard (802.1ae) and runs at line rate. Something like a Juniper PTX10008 can run it at 400Gbps, and it’s just a feature you turn on for the port you’d be using for the link you want to protect anyway (PTXs are routers/switches, not security devices).
If I need to provide encryption on a DCI, I’m at least somewhat likely to have gear that can just do this with vendor support instead of needing to slap together some Linux based solution.
Unless, I suppose, there’s various layer 2 domains you’re stitching together with multiple L2 hops and you don’t control the ones in the middle. In which case I’d just get a different link where that isn’t true.
I have at least one switch that's MACSec compatible at line speed but I haven't had time to take a look. I guess this is confined to LAN and cannot do a MACSec link through the internet, isn't it?
Generally its used when you have links going between two of your sites, so you typically only need it on your switch or router that terminate that link.
I realize this has not much to do with CPU choice per se, but I'm still gonna leave this recommendation here for people who like to build PCs to get stuff done with :) Since I've been able to afford it and the market has had them available, I've been buying desktop systems with proper ECC support.
I've been chasing flimsy but very annoying stability problems (some, of course, due to overclocking during my younger years, when it still had a tangible payoff) enough times on systems I had built that taking this one BIG potential cause out of the equation is worth the few dozens of extra bucks I have to spend on ECC-capable gear many times over.
Trying to validate an ECC-less platform's stability is surprisingly hard, because memtest and friends just aren't very reliably detecting more subtle problems. PRIME95, y-cruncher and linpack (in increasing order of effectiveness) are better than specialzied memory testing software in my experience, but they are not perfect, either.
Most AMD CPUs (but not their APUs with potent iGPUs - there, you will have to buy the "PRO" variants) these days have full support for ECC UDIMMs. If your mainboard vendor also plays ball - annoyingly, only a minority of them enables ECC support in their firmware, so always check for that before buying! - there's not much that can prevent you from having that stability enhancement and reassuring peace of mind.
> only a minority of them enables ECC support in their firmware, so always check for that before buying!
This is the annoying part.
That AMD permits ECC is a truly fantastic situation, but if it's supported by the motherboard is often unlikely and worse: it's not advertised even when it's available.
I have an ASUS PRIME TRX40 PRO and the tech specs say that it can run ECC and non-ECC but not if ECC will be available to the operating system, merely that the DIMMS will work.
It's much more hit and miss in reality than it should be, though this motherboard was a pricey one: one can't use price as a proxy for features.
EDAC MC0: Giving out device to module igen6_edac controller Intel_client_SoC MC#0: DEV 0000:00:00.0 (INTERRUPT)
EDAC MC1: Giving out device to module igen6_edac controller Intel_client_SoC MC#1: DEV 0000:00:00.0 (INTERRUPT)
but `dmidecode --type 16` says:
Error Correction Type: None
Error Information Handle: Not Provided
AFAIK, I have 2x DDR5 non-ECC memory (`dmidecode --type 17` says Samsung M425R1GB4BB0-CQKOL). Your command tells about SECDEC (single bit error correction, double bit error detection).
Usually, if a vendor's spec sheet for a (SOHO/consumer-grade) motherboard mentions ECC-UDIMM explicitly in its memory compatibility section, and (but this is a more recent development afaict) DOES NOT specify something like "operating in non-ECC mode only" at the same time, then you will have proper ECC (and therefore EDAC and RAS) support in Linux, if the kernel version you have can already deal with ECC on your platform in general.
I would assume your particular motherboard to operate with proper SECDED+-level ECC if you have capable, compatible DIMM, enable ECC mode in the firmware, and boot an OS kernel that can make sense of it all.
This is weird. I have used many ASUS MBs specified as "can run ECC and non-ECC" and this has always meant that there was an ECC enabling option in the BIOS settings, and then if the OS had an appropriate EDAC driver for the installed CPU ECC worked fine.
I am writing this message on such an ASUS MB with a Ryzen CPU and working ECC memory. You must check that you actually have a recent enough OS to know your Threadripper CPU and that you have installed any software package required for this (e.g. on Linux "edac-utils" or something with a similar name).
The big problem with ECC for me is that the sticks are so much more expensive. You'd expect ECC UDIMMs to have a price premium of just over 12.5% (because there are 9 chips instead of 8), but it's usually at least 100%. I don't mind paying reasonable premium for ECC, but paying double is too hard to swallow.
Trouble with enterprise is that the people buying care about the technology, but not the cost, while the people that do care about cost don’t understand the technology.
Some businesses (and governments) try and unify their purchasing, but this seems to make things worse, with the purchasing department both not understanding technology and being outwitted by vendors.
> Trouble with enterprise is that the people buying care about the technology, but not the cost
Enterprise also ruins it for small/medium businesses as well, at least those with dedicated internal IT departments who do care about both the technology and the cost. We are left with unreliable consumer-grade hardware, or prohibitively expensive enterprise hardware.
There's very little in between. This market is also underserved with software/SaaS as well with the SSO Tax and whatnot. There's a huge gap between "I'm taking the owner's CC down to best buy" and "Enterprise" that gets screwed over.
Yeah, with that kind of markup you might as well just buy new ones IF they break, or just spend the extra budget on better quality parts. Just having to pick a very specific motherboard that probably is very much not optimal for your build will blow the costs up even more, and for what gain?
I've been building my own gaming and productivity rigs for 20 years and I don't think memory has ever been a problem. Maybe survivorship bias, but surely even budget parts aren't THIS bad.
Let's say you corrupted one bit in a blender asset 200 revisions ago and it was unnoticeable and still upgraded through five blender upgrades, but now on the sixth upgrade it fails with a corruption error and doesn't upgrade.
Without knowing how to fix that error you've lost 200 revisions of work. You can go back and find which revision had the problem, go before that, and upgrade it to the latest blender, but all your 200 revisions were made on other versions that you can't backport.
So don't upgrade it. Export it to an agnostic format and re-import it in the new version. Since it's failing to upgrade, it must be a metadata issue, not a data issue, so removing the Blender-specific bits will fix it.
What a silly hypothetical. There's a myriad freak occurrences that could make you have to redo work that you don't worry about. Now, I'm not saying single-bit errors don't happen. They just typically don't result in the sort of cascading failure you're describing.
Doing a lossy export/reimport process probably isn't going to be viable on something like a big movie scene blender file with lots of constraints, scripted modifiers and stuff that doesn't automatically come through with an export to USD.
My point is that there are scenarios where corruption in the past puts you in a bind and can cause a lot of loss of work or expensive diagnostic and recovery process long after it first occurred, blender was just one example but it can be much worse with proprietary software binary formats where you don't have any chance of jumping into the debugger to figure out what's going wrong with an upgrade or export. And maybe the subscription version of it won't even let you go back to the old version.
> There's a myriad freak occurrences that could make you have to redo work that you don't worry about.
Yes other sources of corruption are more likely from things like software errors. It's not that you wouldn't worry about them if you had unlimited budget and could have people audit the code etc., but you do have a budget and ECC is much cheaper relative to that. That doesn't mean it always makes sense for everyone to pay more for ECC. But I can see why people working on gigantic CAD files for nuclear reactor design, etc. tend to have workstations with ECC.
>a big movie scene blender file with lots of constraints, scripted modifiers and stuff
Not really what I would call an "asset", but fine.
>It's not that you wouldn't worry about them if you had unlimited budget and could have people audit the code etc.
Hell, I was thinking something way simpler, like your cat climbing on the case and throwing up through the top vents, or you tripping and dropping your ass on your desk and sending everything flying.
>But I can see why people working on gigantic CAD files for nuclear reactor design, etc. tend to have workstations with ECC.
Yeah, because those people aren't buying their own machines. If the credit card is yours and you're not doing something super critical, you're probably better served by a faster processor than by worrying against freak accidents.
>Let's say you corrupted one bit in a blender asset 200 revisions ago and it was unnoticeable and still upgraded through five blender upgrades, but now on the sixth upgrade it fails with a corruption error and doesn't upgrade.
And let's say you have archived copies of it with checksums like I suggested, going back to all revisions ago.
What's the issue again now, that ECC would have solved? Not to mention that ECC wouldn't help at all with corruption at the disk level anyway.
You would think that competition would naturally regulate the price down, but it seems like we are dealing with some sort of a cartel that regulators have not caught up with yet.
Isn't it mostly an ease of mind thing? I've never seen a ECC error on my home server which has plenty of memory in use and runs longer than my desktop. Maybe it's more common with higher clocked, near the limit, desktop PC's.
Also: DDR5 has some false ecc marketing due to the memory standard having an error correction scheme build in. Don't fall for it.
Whether you will see ECC errors depends a lot on how much memory you have and how old it is.
A computer with 64 GB of memory is 4 times more likely to encounter memory errors than one with 16 GB of memory.
When DIMMs are new, at the usual amounts of memory for desktops, you will see at most a few errors per year, sometimes only an error after a few years. With old DIMMs, some of them will start to have frequent errors (such modules presumably had a borderline bad fabrication quality and now have become worn out, e.g. due to increased leakage leading to storing a lower amount of charge on the memory cell capacitors).
For such bad DIMMs, the frequency of errors will increase, and it may become of several errors per day, or even per hour.
For me, a very important advantage of ECC has been the ability to detect such bad memory modules (in computers that have been used for 5 years or more) and replace them before corrupting any precious data.
I also had a case with a HP laptop with ECC, where memory errors had become frequent after being stored for a long time (more than a year) in a rather humid place, which might have caused some oxidation of the SODIMM socket contacts, because removing the SODIMMs, scrubbing the sockets and reinserting the SODIMMs made disappear the errors.
>A computer with 64 GB of memory is 4 times more likely to encounter memory errors than one with 16 GB of memory.
No. Or well, not exactly. More bits will flip randomly, but if between the two systems only the total installed memory changed, both systems will see the same amount of memory errors, because bit flips on the additional 48 GB will not result in errors, because they will not be used. Memory errors scale with memory used not with memory installed.
The extra unused memory might even act as shielding to cosmic rays, but the extra electrical load on the memory controller might more than balance that out for unbuffered sticks
I see a particular ECC error at least weekly on my home desktop system, because one of my DIMMs doesn't like the (out of spec) clock rate that I make it operate at. Looks like this:
94 2025-08-26 01:49:40 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x68aea758, cpuid=0x00a50f00, bank=0x00000012
95 2025-09-01 09:41:50 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x68b80667, cpuid=0x00a50f00, bank=0x00000012
(this is `sudo ras-mc-ctl --errors` output)
It's always the same address, and always a Corrected Error (obviously, otherwise my kernel would panic). However, operating my system's memory at this clock and latency boosts x265 encoding performance (just one of the benchmarks I picked when trying to figure out how to handle this particular tradeoff) by about 12%. That is an improvement I am willing to stomach the extra risk of effectively overclocking the memory module beyond its comformt zone for, given that I can fully mitigate it by virtue of properly working ECC.
"Breaks down" is a strong choice of words for a single, corrected bit error. ECC works as designed, and demonstrates that it does by detecting this re-occurring error. I take the confidence mostly from experience ;)
And no, as ECC UDIMM for the speed (3600MHz) I run mine at simply does not exist - it is outside of what JEDEC ratified for the DDR4 spec.
I would loosen the memory timings a bit and see if that resolves the ECC errors. x265 performance shouldn't fall since it generally benefits more from memory clock rate than latency.
Also, could you share some relevant info about your processor, mainboard, and UEFI? I see many internet commenters question whether their ECC is working (or ask if a particular setup would work), and far fewer that report a successful ECC consumer desktop build. So it would be nice to know some specific product combinations that really work.
There's probably many others with proper ECC support. Vendor spec sheets usually hint at properly working ECC in their firmware if they mention "ECC UDIMM" support specifically.
As for CPUs, that is even easier for AM4: Everything that's not based on a APU core (there are some SKUs marketed without iGPU that just have the iGPU part of the APU disabled, such as the Ryzen 5 5500) cannot support ECC. An exception to that rule are "PRO"-series APUs, such as the Ryzen 5 PRO 5650G et al., which have an iGPU, but also support ECC. Main differences (apart from the integrated graphics) between CPU and APU SKUs is that the latter do not support PCIe 4.0 (APUs are limited to PCIe 3.0), and have a few Watts lower idle power consumption.
If I were to build an AM5 system today, I would look into mainboards from ASUS for proper ECC support - they seem to have it pretty much universally supported on their gear. (Actual out-of-band ECC with EDAC support on Linux, not the DDR5 "on-DIE" stuff.)
I think you've found a particularly weak memory cell, I would start thinking about replacing that module. The consistent memory_channel=1, csrow=0 pattern confirms it's the same physical location failing predictably.
I had a somewhat dodgy stick of used RAM (DDR4 UDIMM) in a Supermicro X11 board. This board is running my NAS, all ZFS, so RAM corruption can equal data corruption. The OS alerted me to recoverable errors on DIMM B2. Swapped it and another DIMM, rebooted, saw DIMM error on slot B1. Swapped it for a spare stick. No more errors.
This was running at like, 1866 or something. It's a pretty barebones 8th gen i3 with a beefier chipset, but ECC still came in clutch. I won't buy hardware for server purposes without it.
I saw a corrected memory error logged every few hours when my current machine was new. It seems to have gone away now, so either some burn-in effect, or ECC accidentally got switched off and all my data is now corrupted. Threadripper 7000 series, 4x64GB DDR5.
Edit: it's probably because I switched it to "energy efficiency mode" instead of "performance mode" because it would occasionally lock up in performance mode. Presumably with the same root cause.
I have a slightly older system with 128 GB of UDIMM DDR4 over four sticks. Ran just fine for quite a while but then I started having mysterious system freezes. Later discovered I had somehow disabled ECC error reporting in my system log on linux... once that was turned back on, oh, I see notices of recoverable errors. I finally found a repeatable way to trigger a freeze with a memory stress testing tool and that was from an unrecoverable error. I couldn't narrow the problem down to a single stick or RAM channel, it seemed to only happen if all 4 slots were occupied, but I eventually figured out that if I just lowered the RAM speed from standard 3200 MHz to the next officially supported (by the sticks) step of 2933 MHz, everything was fine again and no more ECC errors, recoverable or not. Been running like that since.
Last winter I was helping someone put together a new gaming machine... it was so frustrating running into the fake ecc marketing for DDR5 that you mention. The motherboard situation for whether they support it or not, or whether a bios update added support or then removed it or added it back or not, was also really sad. And even worse IMO is that you can't actually max out 4 slots on the top tier mobos unless you're willing to accept a huge drop in RAM speed. Leads to ugly 48 GB sized sticks and limiting to two of them... In the end we didn't go with ECC for that someone, but I was pretty disappointed about it. I'm hoping the next gen will be better, for my own setup running ZFS and such I'm not going to give up ECC.
You have to go pretty far down the rabbit hole to make sure you’ve actually got ECC with [LP]DDR5
Some vendors use hamming codes with “holes” in them, and you need the CPU to also run ECC (or at least error detection) between ram and the cache hierarchy.
Those things are optional in the spec, because we can’t have nice things.
I pick up old serves for my garage system. With edac it is a dream to isolate the fault and be instantly aware. It also lets you determine the severity of the issue. Dimms can run for years with just the one error or overnight explode into streams of corrections. I keep spares so it’s fairly easy to isolate any faults. It’s just how do you want to spend your time?
Excellent point. It's a shame and a travesty that data integrity is still mostly locked away inside servers, leaving most other computing devices effectively toys, the early prototype demo thing but then never finished and sold forever at inflated prices.
I wish AMD would make ECC a properly advertised feature with clear motherboard support. At least DDR5 has some level of ECC.
I wish AMD wouldn't gate APU ECC support behind unobtainium "PRO" SKUs they only give out, seemingly, to your typical "business" OEMs and the rare Chinese miniPC company.
So I'm trying to learn more about this stuff, but aren't there multiple ECC flavors and the AMD consumer CPUs only support one of them (not the one you'd have on servers?)
Does anyone maintain a list with de-facto support of amd chips and mainboards? That partlist site only shows official support IIRC, so it won't give you any results.
The difference between the "unbuffered" ECC DIMMs (ECC UDIMMs), which you must use in desktop motherboards (and in some of those advertised as "workstation" MBs) and the "registered" ECC DIMMs (ECC RDIMMs), which you must use in server motherboards (and in some of the "workstation" MBs), has existed for decades.
However in the past there have existed very few CPU models and MBs that supported either kind of DIMMs, while today this has become completely impossible, as the mechanical and electrical differences between them have increased.
In any case, today, like also 20 years ago, when searching for ECC DIMMs you must always search only the correct type, e.g. unbuffered ECC DIMMs for desktop CPUs.
In general, registered ECC DIMMs are easier to find, because wherever "server memory" is advertised, that is what is meant. For desktop ECC memory, you must be careful to see both "ECC" and "unbuffered" mentioned in the module description.
Had you been looking for "in-band ECC", the cheap ODROID H4 PLUS ($150) or the cheaper ODROID H4 ($110) would have been fine, or for something more expensive some of the variants of Asus NUC 13 Rugged support in-band ECC.
For out-of-band ECC, e.g. with standard ECC SODIMMs, all the embedded SBCs that I have seen used only CPUs that are very obsolete nowadays, i.e. ancient versions of Intel Xeon or old AMD industrial Ryzen CPUs (AMD's series of industrial Ryzen CPUs are typically at least one or two generations behind their laptop/desktop CPUs).
Moreover all such industrial SBCs with ECC SODIMMs were rather large, i.e. either in the 3.5" form factor or in the NanoITX form factor (120 mm x 120 mm), and it might have been necessary to replace their original coolers with bigger heatsinks for fanless operation.
In-band ECC causes a significant decrease of the performance, but for most applications of such mini-PCs the performance is completely acceptable.
Now where can I get 64GB ECC UDIMM DDR5 modules so that my X870E board can have 256GB RAM? The largest I found were just 48GB ECC UDIMMs or 64GB non-ECC UDIMMs.
In my experience, it's generally unwise to push the platform you're on to the outermost of its spec'd limits. At work, we bought several 5950X-based Zen3 workstations with 128GB of 3200MT/s ECC UDIMM, and two of these boxes will only ever POST when you manually downclock memory to 3000MT/s. Past a certain point, it's silicon lottery deciding if you can make reality live up to the datasheets' promises.
I am fine with downclocking the RAM; my X870E board (ProArt) should be fine running ECC, I only use 9800X3D to have a single CCD (maybe upgraded later to EPYC 4585PX) and together have RTX 6000 Pro and 2x NVLinked A6000 in PCIe slots, with two M.2 SSDs. Power supply follows the latest specs as well. This build was meant to be a light-weight Threadripper replacement and ECC is a must for my use cases (it's a build for my summer house so that I can do serious work while there).
Any specific recommendations? I am having random, OS agnostic lockups on my ryzen 1xxx build and thought DDR5 will be enough, but true ECC sounds good.
edit: Looks like a lot of Asus motherboards work, and the thing to look for is "unbuffered" ECC. Kingston has some, I see 32GB module for $190 on Newegg.
Do you live at a very high altitude with a significant amount of solar radiation, or at an underfunded radiology lab or perhaps near a uranium deposit or a melted down nuclear reactor? Because the average machine should never see a memory bit flip error at all during its entire lifetime.
> Furthermore, research shows that precisely targeted three-bit Rowhammer flips prevents ECC memory from noticing the modifications.
Doesn't exactly sound like a use case for ECC memory, given that it can't correct these attacks. Interesting though, I'd have thought that virtual addresses would've largely fixed this.