I agree that the broader software ecosystem has been slow to recognize the importance of Operational Safety. The CrowdStrike outage, while unfortunate, has indeed served as a wake-up call, elevating Operational Safety to a priority for software leaders and CIOs alike.
As you pointed out, the reliance on complex, mission-critical systems is only increasing, and cascading failures are an inherent risk we must address proactively. By learning from organizations like AWS that have successfully integrated Operational Safety into their practices, we can work towards a more resilient and reliable software ecosystem. Let's continue to advocate for making Operational Safety a foundational element in software operations across the industry.
How are users handling this change? With Buoyant cutting down on their investment, does Linkerd OSS have enough contributors to sustain it?
"As of Linkerd 2.15.0, the open source project no longer publishes stable releases. Instead, the vendor community around Linkerd is responsible for supported, stable releases."
Couldn't agree more. As someone who used to work at AWS, I've seen it from both sides. AWS has valid reasons (business and technical) for not taking responsibility for all the layers on top. The missing piece is the operational knowledge AWS possesses but platform teams elsewhere lack access to. That's one reason to bring in a “trusted broker” to bridge this gap.
k8s complexity is a challenge at scale, but its growth seems likely. Reasons include strong community/support, continuous innovation ensures new capabilities regularly added, overall standardization across various layers of substrate to name a few. It's not for everyone though. Teams with simpler needs might find k8s overkill and opt out for valid reasons. Overall, benefits + community support make it a go-to for many, despite the challenges.
2 factors imo - scope of permissions & time spent on manual configuration/approvals as a company grows in size of employees and the number of resources it manages. ClickOps doesn't scale well since more employees requesting access to more resources will result in more context switches to grant/revoke those requests. At my last large fintech company, this led to lots of ad-hoc access request or time spent tracking which file contained the permissions and reviewer for which resource - time that could've been used developing features instead.
As for when - when the devs in a company find themselves spending much of their time manually approving permissions requests or just broadly granting access without paying attention to the scope is a good sign they're ready to move to Iac