Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I read the opposite, that you don't have to be locked-in Ollama's registry if you don't want to.

Could you share a bit more of what you do with llama.cpp? I'd rather use llama-serve but it seems to require a good amount of fiddling with the parameters to have good performance.



Recently llama.cpp made a few common parameters default (-ngl 999, -fa on) so it got simpler: --model and --context-size and --jinja generally does it to start.

We end up fiddling with other parameters because it provides better performance for a particular setup so it's well worth it. One example is the recent --n-cpu-moe switch to offload experts to CPU while filling all available VRAM that can give a 50% boost on models like gpt-oss-120b.

After tasting this, not using it is a no-go. Meanwhile on Ollama there's an open issue asking for this: https://github.com/ollama/ollama/issues/11772

Finally, llama-swap separately provides the auto-loading/unloading feature for multiple models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: