Not the GP, but I currently use a hierarchy of artifacts: requirements doc -> design docs (overall and per-component) -> code+tests. All artifacts are version controlled.
Each level in the hierarchy is empirically ~5X smaller than the level below. This, plus sharding the design docs by component, helps Claude navigate the project and make consistent decision across sessions.
My workflow for adding a feature goes something like this:
1. I iterate with Claude on updating the requirements doc to capture the desired final state of the system from the user's perspective.
2. Once that's done, a different instance of Claude reads the requirements and the design docs and updates the latter to address all the requirements listed in the former. This is done interactively with me in the loop to guide and to resolve ambiguity.
3. Once the technical design is agreed, Claude writes a test plan, usually almost entirely autonomously. The test plan is part of each design doc and is updated as the design evolves.
3a. (Optionally) another Claude instance reviews the design for soundness, completeness, consistency with itself and with the requirements. I review the findings and tell it what to fix and what to ignore.
4. Claude brings unit tests in line with what the test plan says, adding/updating/removing tests but not touching code under test.
4a. (Optionally) the tests are reviewed by another instance of Claude for bugs and inconsistencies with the test plan or the style guide.
5. Claude implements the feature.
5a. (Optionally) another instance reviews the implementation.
For complex changes, I'm quite disciplined to have each step carried out in a different session so that all communinications are done via checked-in artifacts and not through context. For simple changes, I often don't bother and/or skip the reviews.
From time to time, I run standalone garbage collection and consistency checks, where I get Claude to look for dead code, low-value tests, stale parts of the design, duplication, requirements-design-tests-code drift etc. I find it particularly valuable to look for opportunities to make things simpler or even just smaller (fewer tokens/less work to maintain).
Occasionally, I find that I need to instruct Claude to write a benchmark and use it with a profiler to opimise something. I check these in but generally don't bother documenting them. In my case they tend to be one-off things and not part of some regression test suite. Maybe I should just abandon them & re-create if they're ever needed again.
I also have a (very short) coding style guide. It only includes things that Claude consistently gets wrong or does in ways that are not to my liking.
My mildly amusing anecdote is that, whenever Claude Code produces something particularly egregious, I often find it sufficient to reply with just "wtf?" for it to present a much more sensible version of the code (which often needs further refinement, but that's another story...)
But we don't evolve IL or assembly code as the system evolves. We regenerate it from scratch every time.
It is therefore not important whether some intermediate version of that low-level code was completely impossible to understand.
It is not so with LLM-written high-level code. More often than not, it does need to be understood and maintained by someone or something.
These days, I mainly focus on two things in LLM code reviews:
1. Making sure unit tests have good coverage of expected behaviours.
2. Making sure the model is making sound architectural decisions, to avoid accumulating tech debt that'll need to be paid back later. It's very hard to check this with unit tests.
We get stuck reviewing the output assembly when it's broken, and that does happen from time to time. The reason that doesn't happen often is that generation of assembly follows strict rules, which people have tried their best to test. That's not the behavior we're going to get out of a LLM.
Yes, prompts aren't analogous to higher-level code, they're analogous to wizards or something like that which were always rightly viewed with suspicion.
Part of it observability bias: longer, more widespread outages are more likely to draw signficant attention. This doesn't mean that there aren't also shorter, smaller-scope outages, it's just that we're much less likely to know about them.
For example, if there's a problem that gets caught at the 1% stage of a staged rollout, we're probably not going to find ourselves discussing it on HN.
And how is it going, in terms of finding those limit? It would be very interesting to hear about areas where the actual experience turned out to be wildly different from your expectations, in either direction.
This looks cool, but what I'd really like is a self-hosted version that I could use to auto-subtitle videos I already have locally. This would help my language learning a great deal.
If any of you have already figured out a tool/workflow for this, I'd love to learn from your experience.
This thread prompted me to look into this. It seems that all I need is a thin wrapper around whisper-ctranslate2. So I wrote one and am playing with it right now.
I'm finding language auto-detection to be a bit wonky (for example, it repeatedly identified Ladykracher audio as English instead of German). I ended up having to force a language instead. The only show in my library where this approach doesn't work is Parlement[1], but I can live with that.
On the whole this is looking quite promising. Thanks for the idea.
Another potential factor at play is the accuracy of delivery. It is generally easier to accurately deliver one quick dose vs daily doses over multiple weeks (due to patient positioning errors, the patient losing weight, soft tissues moving around etc).
Seems to be a Nintendo term for what other companies might have called a "VDP" (Video Display Processor) or a "VIC" (Video Interface Chip).
Brings warm memories of Yamaha V9938 VDP used in MSX-2 machines.