More

IntelliAvatar · 2026-01-02T01:50:35 1767318635

Nice project.

One thing Cloudflare Workers gets right is strong execution isolation. When self-hosting, what’s the failure model if user code misbehaves? Is there any runtime-level guardrail or tracing for side-effects?

Asking because execution is usually where things go sideways.

max_lt · 2026-01-02T06:54:44 1767336884

Workers that hit limits (CPU, memory, wall-clock) get terminated cleanly with a clear reason. Exceptions are caught with stack traces (at least it should lol), logs stream in real-time.

What's next: execution recording. Every invocation captures a trace: request, binding calls, timing. Replay locally or hand it to an AI debugger. No more "works on my machine".

I think the CLI will look like:

# Replay a recorded execution:

openworkers replay --execution-id abc123

# Replay with updated code, compare behavior:

openworkers replay --execution-id abc123 --worker ./dist/my-fix.js

Production bug -> replay -> AI fix -> verified -> deployed. That's what I have in mind.

IntelliAvatar · 2026-01-02T10:13:32 1767348812

This makes a lot of sense. Recording execution + replay is exactly what’s missing once you move past simple logging.

One thing I’ve found tricky in similar setups is making sure the trace is captured before side-effects happen, otherwise replay can lie to you. If you get that boundary right, the prod → replay → fix → verify loop becomes much more reliable.

Really like the direction.

IntelliAvatar · 2026-01-02T01:50:14 1767318614

Really like the local-first + MCP angle.

How do you handle execution-time guarantees? For example: when an MCP tool call touches the filesystem or network, do you validate + log the side-effects before execution?

I’ve seen audits fail not at planning, but at the exact tool-call boundary.

IntelliAvatar · 2026-01-02T01:49:46 1767318586

This looks great.

One thing I’ve been bitten by with desktop agents is execution-time safety: the plan is correct, but a single malformed path or OS call causes real damage.

Do you enforce any guardrails at the tool boundary (e.g. path sandboxing, network allowlists, dry-run / replay)?

Curious how you’re thinking about this.

schnetzlerjoe · 2026-01-02T02:17:55 1767320275

Phenomenal questions. Sandboxing would be a PHENOMENAL idea. And allowlist it currently is capable of this but does require code changes so configuration based would probably be more what you are referring to?

The replay feature is similar to the record feature. It's not a "guardrail" I would say though.

All stuff that definitely would be great idea.

IntelliAvatar · 2026-01-02T04:33:43 1767328423

Makes sense, thanks for the clarification.

I mostly worry about the gap between a correct plan and execution-time behavior — especially when tools touch the filesystem or OS APIs. Even a single malformed argument can have irreversible effects.

Totally agree these guardrails are non-trivial, but it’s great to see the project thinking in this direction.

IntelliAvatar · 2025-12-25T10:39:09 1766659149

One clarification that may help set expectations:

FailCore is intentionally not an agent framework, planner, or sandbox. It sits strictly at the execution boundary and focuses on two things: 1) blocking unsafe side effects before they happen 2) recording enough execution trace to replay or audit failures later

The goal isn’t to make agents smarter, but to make their failures observable, reproducible, and boring.

If people are curious, the DESIGN.md goes deeper into why this is done at the Python runtime level instead of kernel-level isolation (eBPF, VMs, etc.), and what trade-offs that implies.

IntelliAvatar · 2025-12-25T03:44:12 1766634252

Full formal verification is rare, but partial guarantees at execution boundaries are very practical — especially for systems that act autonomously.

IntelliAvatar · 2025-12-25T03:43:46 1766634226

A runtime layer for AI agents that enforces execution boundaries: traces, replay, and a hard “no” when something unsafe is about to run.

IntelliAvatar · 2025-12-25T03:42:18 1766634138

Observability is step one. The hard part is what the system is allowed to do once you observe it.

IntelliAvatar · 2025-12-21T07:58:37 1766303917

Nice idea. How does this compare to running ephemeral preview environments via ArgoCD or Helmfile today?

Olu · 2025-12-21T15:58:00 1766332680

Thank you for the feedback.

My motivation was to give PR/MR reviewers a very low-friction way to see a Helm chart change running.

The workflow is intentionally simple: install a GitHub App (or call a REST API in other workflows), open a PR/MR, and you get a live preview. That’s it.

There’s no ArgoCD setup, no Helmfile, no cluster provisioning, no DNS wiring to build or maintain. The goal was to make it trivial for reviewers to see “this PR running” — especially for public Helm charts where contributors and reviewers can’t realistically be expected to set up infrastructure just to demo a change.

If you already run ephemeral previews via ArgoCD or Helmfile, this probably isn’t adding much value. Those approaches work well once they’re in place. Chart Preview is aimed at the cases where teams want PR previews without having to design, build, and maintain that machinery themselves.

IntelliAvatar · 2025-12-21T17:00:45 1766336445

That makes sense — thanks for clarifying. Framing it as “zero infra ownership, just a reviewer convenience” really helps explain where this fits compared to ArgoCD-style previews.

IntelliAvatar · 2025-12-21T07:58:14 1766303894

Interesting angle. How do you decide what becomes persistent memory vs transient context? Is there any eviction or decay model?

IntelliAvatar · 2025-12-21T07:57:44 1766303864

How does this differ from asyncio.Queue in terms of backpressure or cancellation semantics?

x42005e1f · 2025-12-21T16:16:32 1766333792

If you use `culsans.Queue().async_q` as a direct replacement for `asyncio.Queue()`, then there is essentially no difference. The difference becomes apparent when you use additional features:

1. If checkpoints are enabled (by default when using Trio, or if you explicitly apply `aiologic.lowlevel.enable_checkpoints()`), then every call that is not explicitly non-blocking can be cancelled (even if no waiting is required). For comparison, `await queue.put(await queue.get())` for `queue = asyncio.Queue()` in an infinite loop will never yield back to the event loop (when 0 < size < maxsize is true), and as a result, no other asyncio tasks will ever continue their execution, and such a loop cannot be cancelled (see PEP 492).

2. With multithreading and corresponding race conditions, method calls are synchronized using the underlying lock (as in `queue.Queue`). This means that such synchronization can temporarily block the event loop, but this is rarely a bottleneck (the same is used in Janus). In general, this delays task cancellation and timeout handling if someone else is still holding the lock. If you need extremely fast and scalable queues, `aiologic.SimpleQueue` may be the best option (it does not use any form of internal state synchronization!).

I am not sure I understand your question well enough. `asyncio.Queue` works exclusively in cooperative multitasking (it is not thread-safe) with all the resulting simplifications. The principle of operation of Culsans queues under the same conditions is almost the same as that of any other queues capable of operating as purely asynchronous with cancellation support (perhaps you are referring to starting new threads or new tasks as an implementation detail? aiologic and Culsans do not use any of this). As soon as preemptive multitasking is introduced, the behavior may change somewhat - `culsans.Queue` relies on sync-only synchronization of the internal state, `aiologic.Queue` on async-aware synchronization (without blocking the event loop; still used because `heapq` functions are not thread-safe, and they are required for priority queues; but the wait queues are combined, which achieves fairness and solves python/cpython#90968), and `aiologic.SimpleQueue` does not synchronize the internal state at all due to the use of effectively atomic operations.

x42005e1f · 2025-12-21T17:45:39 1766339139

I would like to add that you can also read about some non-trivial details in the "Performance" section of the aiologic documentation [4]. What is described there for standard primitives also applies to Culsans queues (specifically, the mutex case; however, other documentation sections (such as "Why?", "Overview", and "Libraries") are also relevant to Culsans, since aiologic is used under the hood).

[4] https://aiologic.readthedocs.io/latest/performance.html

IntelliAvatar · 2025-12-21T17:01:47 1766336507

Thanks, that clarifies it. The checkpoint-based cancellation and the sync-vs-async locking model differences were exactly what I was trying to understand.