Hacker Newsnew | past | comments | ask | show | jobs | submit | kaicianflone's commentslogin

Great read. The bilingual shadow reasoning example is especially concerning. Subtle policy shifts reshaping downstream decisions is exactly the kind of failure mode that won’t show up in a benchmark leaderboard.

My wife is trilingual, so now I’m tempted to use her as a manual red team for my own guardrail prompts.

I’m working in LLM guardrails as well, and what worries me is orchestration becoming its own failure layer. We keep assuming a single model or policy can “catch” errors. But even a 1% miss rate, when composed across multi-agent systems, cascades quickly in high-stakes domains.

I suspect we’ll see more K-LLM architectures where models are deliberately specialized, cross-checked, and policy-scored rather than assuming one frontier model can do everything. Guardrails probably need to move from static policy filters to composable decision layers with observability across languages and roles.

Appreciate you publishing the methodology and tooling openly. That’s the kind of work this space needs.


The cascading failure point is critical. A 1% miss rate per layer in a 5-layer pipeline gives you roughly 5% end-to-end failure, and that's assuming independence. In practice the failures correlate because multilingual edge cases that bypass one guardrail tend to bypass adjacent ones too.

The observation that guardrails need to move from static policy filters to composable decision layers is exactly right. But I'd push further: the layer that matters most isn't the one checking outputs. It's the one checking authority before the action happens.

A policy filter that misses a Persian prompt injection still blocks the action if the agent doesn't hold a valid authorization token for that scope. The authorization check doesn't need to understand the content at all. It just needs to verify: does this agent have a cryptographically valid, non-exhausted capability token for this specific action?

That separates the content safety problem (hard, language-dependent, probabilistic) from the authority control problem (solvable with crypto, language-independent, deterministic). You still need both, but the structural layer catches what the probabilistic layer misses.


For some reason before reading I thought this was going to be an AI thought leadership piece but it's even better than I expected.

I’m not sure if I prefer coding in 2025 or 2026 now

This is why I’m using the open source consensus-tools engine and CLI under the hood. I run ~100 maintainer-style agents against changes, but inference is gated at the final decision layer.

Agents compete and review, then the best proposal gets promoted to me as a PR. I stay in control and sync back to the fork.

It’s not auto-merge. It’s structured pressure before human merge.


I’m working on an open source project that treats this as a consensus problem instead of a single model accuracy problem.

You define a policy (majority, weighted vote, quorum), set the confidence level you want, and run enough independent inferences to reach it. Cost is visible because reliability just becomes a function of compute.

The question shifts from “is this output correct?” to “how much certainty do we need, and what are we willing to pay for it?”

Still early, but the goal is to make accuracy and cost explicit and tunable.


This matches how I’ve been thinking about it.

With consensus.tools we split things intentionally. The OSS CLI solves the single user case. You can run local "consensus boards" and experiment with policies and agent coordination without asking anyone for permission.

Anything involving teams, staking, hosted infra, or governance sits outside that core.

Open source for us is the entry point and trust layer, not the whole business. Still early, but the federation vs stadium framing is useful.


This is interesting. I’m experimenting with something adjacent in an open source plugin, but focused less on orchestration and more on decision quality.

Instead of just wiring agents together, I require stake and structured review around outputs. The idea is simple: coordination without cost trends toward noise.

Curious how entire.io thinks about incentives and failure modes as systems scale.


I’m working on an open source CLI that experiments with governance at inference time for autonomous systems.

The idea is to let multiple agents propose, critique, and stake on decisions before a single action is taken, rather than letting one model silently decide. It’s model-agnostic and runs locally, with no blockchain or financial layer involved.

I’m mostly exploring whether adding explicit disagreement and cost at decision time actually improves outcomes in high-stakes or automated workflows.

https://github.com/consensus-tools/consensus-tools

I've also created an AgentSkill to interact with the cli:

https://github.com/kaicianflone/consensus-interact


I think the core insight here is about incentives and friction, not crypto specifically.

I’m working on an open source CLI that experiments with this at a local, off-chain level. It lets maintainers introduce cost, review pressure, or reputation at submission time without tying anything to money or blockchains. The goal is to reduce low-quality contributions without financializing the workflow or creating new attack surfaces.


I’m a systems person too, and I don’t see mediocrity as inevitable.

The slop problem isn’t just model quality. It’s incentives and decision making at inference time. That’s why I’m working on an open source tool for governance and validation during inference, rather than trying to solve everything in pre training.

Better systems can produce better outcomes, even with the same models.


I agree completely, and I’m doing the same thing. Good tools that help produce better outcomes will have a multiplicative impact as models improve.

What are you building?


The open source system I’m working on lets multiple agents propose, critique, and stake on decisions before a single action is taken.

It runs at inference time rather than training time and is model agnostic. The goal is to make disagreement explicit and costly instead of implicit and ignored, especially in high stakes or autonomous workflows.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: