I love local first. I am finding that a 120B MoE is hitting the sweet spot for local hosted. Right now that takes a 2K strix halo, a 4k GB10 machine, or a 5k Mac Pro. 2 years from now I think hardware will take us back to the 2k ish range with good performance.
I love my dual GPU setup (2AMD Radeon r9700 64GB vram) but it costs 5x electricity than my GX10 (GB10 chip inside) and since layers are landing in system memory my TPS is half the GX10.
Now a dense model like Devstral2 24B slaps on the Dual GPU setup. I just haven’t gotten as much out of that as I have the 120 MoEs
Alternative headline: household spyware cash machine forced to pay $20 for being bad.
If you want to punish Meta then you have to punish the wonder boy who runs it. Not even share holders can fight off the guy spending 80B on the metaverse.
Sadly I don't think it's enough for Meta to change, because they have no business model if they are forced to be serious about online safety. That's probably also why they are pushing so hard for age verification, make safety a problem for someone else.
Most people are using something in the llama family for inference. Llama server is my go to. Unsloth guides describe how to configure inference for your model of choice.
I don’t know why this keeps coming up. Has this been a big deal for everyone else? Like ok usability improvement, but the number of times I have read an article about this is silly.
reply