Good post. I recently built a choose-your-own-adventure style educational game at work for a hackathon.
Prompting an LLM to generate and run a game like this gave immediate impressive results, 10 mins after starting we had something that looked great. The problem was that the game sucked. It always went 3-4 rounds of input regardless. It constantly gave the game away because it had all the knowledge in the context, and it just didn't have the right flow at all.
What we ended up with at the end of the ~2 days was a whole bunch of Python orchestrating 11 different prompts, no cases where the user could directly interact with the LLM, only one case where we re-used context across multiple queries, and a bunch of (basic) RAG to hide game state from the LLM until the user caused it to be revealed through their actions.
LLMs are best used as small cogs in a bigger machine. Very capable, nearly magic cogs, but orchestrated by a lot of regular engineering work.
Prompting an LLM to generate and run a game like this gave immediate impressive results, 10 mins after starting we had something that looked great. The problem was that the game sucked. It always went 3-4 rounds of input regardless. It constantly gave the game away because it had all the knowledge in the context, and it just didn't have the right flow at all.
I'm confused. Did you ask the LLM to write the game in code? Or did the LLM run the entire game via inference?
Why do you expect that the LLM can generate the entire game with a few prompts and work exactly the way you want it? Did your prompt specify the exact conditions for the game?
> Or did the LLM run the entire game via inference?
This, this was our 10 minute prototype, with a prompt along the lines of "You're running a CYOA game about this scenario...".
> Why do you expect that the LLM can generate the entire game with a few prompts
I did not expect it to work, and indeed it didn't, however why it didn't work wasn't obvious to the whole group, and much of the iteration process in the hackathon was breaking things down into smaller components so that we could retain more control over the gameplay.
One surprising thing I hinted at there was using RAG not for its ability to expose more info to the model than can fit in context, but rather for its ability to hide info from the model until its "discovered" in some way. I hadn't considered that before and it was fun to figure out.
> using RAG not for its ability to expose more info to the model than can fit in context, but rather for its ability to hide info from the model until its "discovered" in some way
Yeah sure. The problem we had was that we had some "facts" to base the game on, but when the LLM generated multiple choice choose-you-own-adventure style options, they would end up being leading questions towards the facts, i.e. the LLM knows what's behind the door, so an option might have been "check for the thing behind the door", and the user now knows it's there because why else would it have asked.
Instead we put all the facts in a RAG database. Now when we ask the LLM to generate options it does so not knowing the actual answer, so they can't really be leading questions. We then take the user input, use RAG to get relevant facts, and then "reveal" those facts to the LLM in subsequent prompts.
Honestly we still didn't nail gameplay or anything, it was pretty janky but it was 2 days, a bunch of learning, and probably only 300 lines of Python in the end, so I don't want to overstate what we did. However this one detail was one that stuck with me.
LLMs work much better on narrow tasks. They get more lost the more information you introduce. Models are introducing reasoning now which is trying to assert this problem and some models are getting really good at it like o3 or reasoner.com. I have access to both and it looks like, soon, we will have models that become more accurate when we introduce more complexities, which will be a huge breakthrough in AI.
I've run numerous interactive text adventures through ChatGPT as well, and while it's great at coming up with scenarios and taking the story in surprising directions, it sucks at maintaining a coherent narrative. The stories are fraught with continuity errors. What time of day it is seems to be decided at random, and it frequently forgets things I did or items picked up previously that are important. It also needs to be constantly reminded of rules that I gave it in the initial prompt. Basically, stuff that the article refers to as "maintaining state."
I've become wary of trusting it with any task that takes more than 5-10 prompts to achieve. The more I need to prompt it, the more frequently it hallucinates.
> What we ended up with at the end of the ~2 days was a whole bunch of Python orchestrating 11 different prompts, no cases where the user could directly interact with the LLM, only one case where we re-used context across multiple queries, and a bunch of (basic) RAG to hide game state from the LLM until the user caused it to be revealed through their actions.
Super cool! I'm the author of the article. Send me an email if you ever just wanna chat about this on a call.
Prompting an LLM to generate and run a game like this gave immediate impressive results, 10 mins after starting we had something that looked great. The problem was that the game sucked. It always went 3-4 rounds of input regardless. It constantly gave the game away because it had all the knowledge in the context, and it just didn't have the right flow at all.
What we ended up with at the end of the ~2 days was a whole bunch of Python orchestrating 11 different prompts, no cases where the user could directly interact with the LLM, only one case where we re-used context across multiple queries, and a bunch of (basic) RAG to hide game state from the LLM until the user caused it to be revealed through their actions.
LLMs are best used as small cogs in a bigger machine. Very capable, nearly magic cogs, but orchestrated by a lot of regular engineering work.