I started off using gpt-oss-120b on cpu. It uses about 60-65gb of memory or so but my workstation has 128gb of ram. If I had less ram, I would start off with the gpt-oss-20b model and go from there. Look for MoE models as they are more efficient to run.
My old threadripper pro was seeing about 15tps, which was quite acceptable for the background tasks I was running.
Pretty straightforward. Sources dump into a queue throughout the day, regex filters the obvious junk ("lol", "thanks", bot messages never hit the LLM), then everything gets batched overnight through Anthropic's Batch API for classification. Feedback gets clustered against existing pain points or creates new ones.
Most of the cost savings came from not sending stuff to the LLM that didn't need to go there, plus the batch API is half the price of real-time calls.