This a great idea, but it's a great idea when on-prem.
During some thread, some where, there's going to be a roundtrip time between my servers and yours, and once I am at a scale where this sort of thing matters, I'm going to want this on-prem.
What's the difference between this and checking against a local cache before firing the request and marking the service down in said local cache so my other systems can see it?
I'm also concerned about a false positive or a single system throwing an error. If it's a false positive, then the protected asset fails on all of my systems, which doesn't seem great. I'll take some requests working vs none when money is in play.
You also state that "The SDK keeps a local cache of breaker state" -- If I've got 50 servers, where is that local cache living? If it's per process, that's not great, and if it's in a local cache like redis or memcache, I'm better off using my own network for "sub microsecond response" vs the time to go over the wire to talk to your service.
I've fought huge cascading issues in production at very large social media companies. It takes a bit more than breakers to solve these problems. Backpressure is a critical component of this, and often turning things off completey isn't the best approach.
On-prem: You're right, and it's on the roadmap. For teams at the scale you're describing, a hosted control plane doesn't make sense. The architecture is designed to be deployable as a self-hosted service, the SDK doesn't care where the control plane lives, just that it can reach it (you can swap the OpenfuseCloud class with just the Openfuse one, using your own URL).
Roundtrip time: The SDK never sits in the hot path of your actual request. It doesn't check our service before firing each call. It keeps a local cache of the current breaker state and evaluates locally, the decision to allow or block a request is pure local memory, not a network hop. The control plane pushes state updates asynchronously. So your request latency isn't affected. The propagation delay is how quickly a state change reaches all instances, not how long each request waits.
False positives / single system errors: This is exactly why aggregation matters. Openfuse doesn't trip because one instance saw one error. It aggregates failure metrics across the fleet, you set thresholds on the collective signal (e.g., 40% failure rate across all instances in a 30s window). A single server throwing an error doesn't move that needle. The thresholds and evaluation windows are configurable precisely for this reason.
Local cache location: It's in-process memory, not Redis or Memcache. Each SDK instance holds the last known breaker state in memory. The control plane pushes updates to connected SDKs. So the per-request check is: read a boolean from local memory. The network only comes into play when state changes propagate, not on every call.
The cache size for 100 breakers is ~57KB, and for 1000, which is quite extreme, is ~393KB.
Backpressure: 100% agree, breakers alone don't solve cascading failures. They're one layer. Openfuse is specifically tackling the coordination and visibility gap in that layer, not claiming to replace load shedding, rate limiting, retry budgets, or backpressure strategies. Those are complementary. The question I'm trying to answer is narrower: when you do have breakers, why is every instance making that decision independently? why do you have no control over what's going on? why do you need to make a code change to temporarily disconnect your server from a dependency? And if you have 20 services, you configure it 20 times (1 for each repo)?
Would love to hear more about what you've seen work at scale for the backpressure side. That would be a next step :)
Caveat: I was employee 13 at Twitter and I spent a long time dealing with random failure modes.
At extremely high scale you start to run into very strange problems. We used to say that all of your "Unix Friends" fail at scale and act differently.
I once had 3000 machines running NTP sync'd cronjobs on the exact same second pounding the upstream server and causing outages (Whoops, add random offsets to cron!)
This sort of "dogpile effect" exists when fetching keys as well. A key drops out of cache and 30 machines (or worker threads) trying to load the same key at the same time, because the cache is empty.
One of the solutions around this problem was Facebook's Dataloader (https://github.com/graphql/dataloader), which tries to intercept the request pipeline, batch the requests together and coalesce many requests into one.
Essentially DataLoader will coalesce all individual loads which occur within a single frame of execution (a single tick of the event loop) and then call your batch function with all requested keys.
It helps by reducing requests and offering something resembling backpressure by moving the request into one code path.
I would expect that you'd have the same sort of problem at scale with this system given the number of requests on many procs across many machines.
We had a lot of small tricks like this (they add up!), in some cases we'd insert a message queue inbetween the requestor and the service so that we could increase latency / reduce request rate while systems were degraded. Those "knobs" were generally implemented by "Decider" code which read keys from memcache to figure out what to do.
By "pushes to connected SDKs": I assume you're holding a thread with this connection; How do you reconcile this when you're running something like node with PM2 where you've got 30-60 processes on a single host? They won't be sharing memory, so that's a lot of updates.
It seems better to have these updates pushed to one local process that other processes can read from via socket or shared memory.
I'd also consider the many failure modes of services. Sometimes services go catatonic upon connect and don't respond, sometimes they time out, sometimes they throw exceptions, etc...
There's a lot to think about here but as I said what you've got is a great start.
This is incredibly generous context... thank you. A few of these hit close to problems I'm thinking about.
The Decider pattern you're describing (reading keys from memcache to decide behavior at runtime) is essentially what Openfuse is trying to productize. A centralized place that tells your fleet how to behave, without each process figuring it out independently. So it's validating to hear that's where Twitter landed organically.
On the PM2 point: you're right, holding a connection per process doesn't scale well at that huge scale. A local sidecar that receives state updates and exposes them via socket or shared memory to sibling processes is a much better model at that density. That's not how it works today, each process holds its own connection, but your framing is exactly how I'd want to evolve it. However, I can't say that is in the short-term goals for now, need to validate the product first and add some important features + publish the self hosted version.
On the dogpile: the half-open state is where this matters most. When a breaker opens and then transitions to half-open, you don't want 50 instances all sending probe requests simultaneously. The coalescing pattern you're describing from DataLoader is a neat way of solving it, I wonder if I can implement this somehow without adding a service/proxy closer to the clients just for that.
On failure modes: agreed, "service is down" is the simplest case. Catatonic connections, slow degradation, partial responses that look valid but aren't, those are harder to classify. Right now Openfuse trips on error rates, timeouts, and latency. However, the back-end is ready for custom metrics, I just didn't implement them yet. Having the breaker tripping based on OpenTelemetry metrics is also something I am looking forward to try, which opens a whole new world.
I'm not going to pretend this is built for Twitter-scale problems today. But hearing that the patterns you arrived at are directionally where this is headed is really encouraging.
Didn’t we solve this already with slab allocators in memcached? The major problem with fixed allocation like this is fragmentation in memory over time, which you then have to reinvent GC for.
The trivial defense against this is time limited passwords for Wifi access. Deny all access until a valid password is entered, only permit that password and MAC address pair for n minutes.
On a technical level it’s trivial, but you’re taking about having a shop replace their wifi router and/or update firmware, create some way for staff to see the current password and/or integrate with POS systems to print it on the receipt, update signage, etc. Hardly trivial for the average non-techie business owner.
Great idea but yet another blog post, which is actually marketing, which ends with “they did it buy our product so you can too”, which is probably not what Meta did.
Sure, it doesn't directly affect any practical programming tasks to matter in real life... but sometimes it's good to indulge in intellectual curiosity. It sharpens our understanding of how far a language can go, no matter how futile that understanding is. This is HN after all and "anything that gratifies one's intellectual curiosity" is on-topic on HN.
It matters in the way all academmic exploration matters.
Which is to say it doesn't matter if it matters.
Or rather, you can say it matters or not depending on your own personality and capacity for intellectual curiosity.
If all you care about is can I eat it or fuck it, then it doesn't matter. The limits of c match the limits of the hardware, and so you can never run into them. You can express anything the hardware could do. Except, the only way I can say "it doesn't matter within this scope, because of this scope" is because someone at least thought about it long enough to figure that out. The question mattered even though the answer came back "don't worry about it", because that answer matters.
It matters, or rather it may matter (you or someone has think about it for a while to find out), if you would be one of the people who helps understand the current world and helps design the next world rather than just use the world however it happens to exist today.
It may also be that it would matter, but the premis is wrong in some way, which again you can't know without formulating the problem and then thinking about it.
I would say that c with variable pointer sizes or compoundable pointers would still be c, and so if the spec happens to specify the size of pointers, that's just a silly implementation detail that shouldn't be hard coded in the spec and could be ignored or officially revised without changing anything that matters. Like saying the language isn't turing complete because the original proposal has a typo.
Another tack: Someone else pointed out that VA_ARGS gets around it. But that's not magic. VA_ARGS has to be implemented somehow using the tools of the rest of the spec. If one part of the spec says something that the rest of the spec can't do, then the spec is inconsistent. That doesn't mean the language isn't turing complete. It means it might or might not be depending on how you choose to reslove the incosistency.
Not a CS guy, but afaiu guaranteed termination lowers the automata class to push-down automatons or FSMs, which are definitely not what we’d agree to call general purpose programming languages. In practical terms, they have a computing regime equivalent to that one of regexps of various sorts.
probably the hardest thing here is people read faster than they write.
in the 1990s I helped build a dating site where people put their profiles on small voicemails. This is very reminiscent of that and I think the engagement was low because it was a lot to listen to
During some thread, some where, there's going to be a roundtrip time between my servers and yours, and once I am at a scale where this sort of thing matters, I'm going to want this on-prem.
What's the difference between this and checking against a local cache before firing the request and marking the service down in said local cache so my other systems can see it?
I'm also concerned about a false positive or a single system throwing an error. If it's a false positive, then the protected asset fails on all of my systems, which doesn't seem great. I'll take some requests working vs none when money is in play.
You also state that "The SDK keeps a local cache of breaker state" -- If I've got 50 servers, where is that local cache living? If it's per process, that's not great, and if it's in a local cache like redis or memcache, I'm better off using my own network for "sub microsecond response" vs the time to go over the wire to talk to your service.
I've fought huge cascading issues in production at very large social media companies. It takes a bit more than breakers to solve these problems. Backpressure is a critical component of this, and often turning things off completey isn't the best approach.