I have tried a lot of local models. I have 656GB of them on my computer so I hav...

anon373839 · 2025-03-12T09:07:10 1741770430

From the limited testing I've done, Gemma 3 27B appears to be an incredibly strong model. But I'm not seeing the same performance in Ollama as I'm seeing on aistudio.google.com. So, I'd recommend trying it from the source before you draw any conclusions.

One of the downsides of open models is that there are a gazillion little parameters at inference time (sampling strategy, prompt template, etc.) that can easily impair a model's performance. It takes some time for the community to iron out the wrinkles.

moffkalast · 2025-03-12T09:17:01 1741771021

At the end of the day it doesn't matter how good it its, it has no system prompt which means no steerability, a sliding window for incredibly slow inference compared to similar sized models because it's too niche and most inference systems have high overhead implementations of it, and Google's psychotic instruct tuning that made Gemma 2 an inconsistent and unreliable glass cannon.

I mean hell, even Mistral added system prompts in their last release, Google are the only ones that don't seem to bother with it by now.

hnfong · 2025-03-12T10:41:39 1741776099

If you actually looked at gemma-3 you’ll see that it does support system prompts.

I’ve never seen a case where putting the system prompt in the user prompt would lead to significantly different outcomes though. Would like to see some examples.

(edit: my bad. i stand corrected. it seems the code just prepends the system prompts to the first user prompt.)

CoryLR · 2025-03-12T11:19:33 1741778373

This has been my experience as well. I don’t use system prompts anymore. Curious if there’s a good reason to start.

moffkalast · 2025-03-12T11:30:56 1741779056

It matters as a common standard for model integration. System messages aren't removed when regular context messages start to get cut, the model pays more attention to tools and other directives defined there. Like situational context, language it should use, type of responses, etc. OAI/Anthropic give their models system prompts a mile long to tune their behaviour to a T.

There's also the ideological camp with Hartford and Nous and the rest where models are supposed to be trained as generally as possible, with the system prompt being strictly followed to adapt it to specific use cases.

tobacco-sausage · 2025-03-12T11:57:01 1741780621

Direct prompt injection attacks can be somewhat mitigated by using system prompts (though not completely) if that helps.

moffkalast · 2025-03-12T11:18:43 1741778323

I've read the Gemma 3 technical report, it doesn't mention anything about it in the format section. Did they forget to include that? Where did you find the source that claims otherwise?

staticman2 · 2025-03-12T12:11:57 1741781517

Google AI Studio offers no system prompt option for Gemma 3 which also suggests this model doesn't support it...

moffkalast · 2025-03-12T12:19:07 1741781947

The official ggufs have this in them as an example, I'm even more confused now:

<start_of_turn>user

You are a helpful assistant

Hello<end_of_turn>

<start_of_turn>model

Hi there<end_of_turn>

<start_of_turn>user

How are you?<end_of_turn>

<start_of_turn>model

hnfong · 2025-03-12T12:39:23 1741783163

I was confused by this too, but now I think they’re just suggesting to prepend the system prompt to the first user prompt.

I think the ggufs took the chat template defined in tokenizer_config.json , which basically does that.

hnfong · 2025-03-12T12:18:55 1741781935

My bad, original comment updated.

sieve · 2025-03-12T10:34:55 1741775695

The Gemma 2 Instruct models are quite good (9 & 27B) for writing. The 27B is good at following instructions. I also like DeepSeek R1 Distill Llama 70B.

The Gemma 3 Instruct 4B model that was released today matches the output of the larger models for some of the stuff I am trying.

Recently, I compared 13 different online and local LLMs in a test where they tried to recreate Saki's "The Open Window" from a prompt.[1] Claude wins hands down IMO, but the other models are not bad.

[1] Variations on a Theme of Saki (https://gist.github.com/s-i-e-v-e/b4d696bfb08488aeb893cce3a4...)

mythz · 2025-03-12T08:54:57 1741769697

Concur with Gemma2 being underwhelming, I dismissed it pretty quickly but gemma3:27b is looking pretty good atm.

BTW mistral-small:24b is also worth mentioning (IMO best local model) and phi4:14b is also pretty strong for its size.

mistral-small was my previous local goto model, testing now to see if gemma3 can replace it.

InsideOutSanta · 2025-03-12T09:05:40 1741770340

One more vote for Mistral for local models. The 7B model is extremely fast and still good enough for many prompts.

zacksiri · 2025-03-12T10:24:09 1741775049

You should try Mistral Small 24b it’s been my daily companion for a while and have continued to impress me daily. I’ve heard good things about QwQ 32b that just came out too.

jrm4 · 2025-03-12T10:40:35 1741776035

Nice, I think you're nailing the important thing -- which is "what exactly are they good FOR?"

I see a lot of talk about good and not good here, but (and a question for everyone) what are people using the non-local big boys for that the locals CAN'T do? I mean, IRL tasks?

blooalien · 2025-03-12T09:22:14 1741771334

I have had nothing but good results using the Qwen2.5 and Hermes3 models. The response times and token generation speeds have been pretty fantastic compared against other models I've tried, too.

usef- · 2025-03-12T11:33:33 1741779213

To clarify, are you basing this comment on experience with previous Gemma releases, or the one from today?

mupuff1234 · 2025-03-12T10:05:18 1741773918

Ok, but have you tried Gemma3?

rpastuszak · 2025-03-12T10:21:28 1741774888

Thanks for the overview.

> Qwen2.5-Coder-14B-Instruct - Good for simple coding tasks > OpenThinker-7B - Good and fast reasoning

Any chance you could be more specific, ie give an example of a concrete coding task or reasoning problem you used them for?

miroljub · 2025-03-12T10:49:30 1741776570

Qwen2.5-Coder:32B is the best open source coding model. I use it daily, and I don't notice that it lags much behind Claude 3.5.

I would be actually happy to see R1 distilled version, it may make it perform better with the less resource usage.

rpastuszak · 2025-03-12T10:54:04 1741776844

Thanks! Do you use it with Aider/terminal/a web GUI?

miroljub · 2025-03-12T11:56:01 1741780561

I use it with Emacs ellama, Continue.dev plugin and as a Web Chat.

thom · 2025-03-12T10:50:26 1741776626

Could you talk a little more about your D&D usage? This has turned into one of my primary use cases for ChatGPT, cooking up encounters or NPCs with a certain flavour if I don't have time to think something up myself. I've also been working on hooking up to the D&D Beyond API so you can get everything into homebrew monsters and encounters.

archerx · 2025-03-12T13:47:20 1741787240

I noticed the prompt makes a big difference in the experience you get. I have a simple game prompt.

The first prompt I tested out I got from this video; https://www.youtube.com/watch?v=0Cq-LuJnaRg

It was ok and produces shallow adventures.

The second one I tried was from this site; https://www.rpgprompts.com/post/dungeons-dragons-chatgpt-pro...

a bit better and is easier to modify but still shallow.

The best one I have tried so far is this one from reddit; https://old.reddit.com/r/ChatGPT/comments/zoiqro/most_improv...

It is a super long prompt and I had to edit it a lot, and manually extract the data from some of the links but it has been the best experience by far. I even became "friends" with an NPC who accompanied me on a quest and it was a lot of fun and I was fully engaged.

The model of choice matters but even llama 1B and 2B can handle some stories.

camel_Snake · 2025-03-12T20:45:04 1741812304

May want to check out the Wayfarer models on: https://huggingface.co/LatitudeGames

afaik they are more for roleplaying a D&D style adventure than planning it, but I've heard good things.

DeepSeaTortoise · 2025-03-12T09:08:13 1741770493

TBH, I REALLY like the tiny models. Like smollm2.

Also lobotomized LLMs ("abliterated") can be a lot of fun.

andai · 2025-03-12T10:18:48 1741774728

I think you mean un-lobotomize, and apparently it can be done without retraining? Wild!

https://huggingface.co/blog/mlabonne/abliteration

memhole · 2025-03-12T14:44:24 1741790664

Thanks for clarifying that. I can understand their take. But, I also think of the models without abliteration as lobotomized

pduggishetti · 2025-03-12T09:06:56 1741770416

Recently phi4 has been very good too!

m00dy · 2025-03-12T11:19:55 1741778395

sshht, don't make it a public debate :P)

memhole · 2025-03-12T09:59:00 1741773540

Do you mostly stick with smaller models? I’m pretty surprised at how good the smaller models can be at times now. A year ago they were nearly useless. I kind of like too the hallucinations are more obvious sometimes. Or at least it seems like they are.

archerx · 2025-03-12T10:26:03 1741775163

I like the smaller models because they are faster. I even got a Llama 3 1B model running on TinkerBoard 2S and it was fun to play around with and not too slow. The smaller models are still good at summarizing and other basic tasks. For coding they start showing their limits but still work great for trying to figure out issues in small bits of code.

The real issue with local models is managing context. smaller models let you have a longer context without losing performance but bigger models are smarter but if you want to keep it fast I have to reduce the context length.

Also all of the models have their own "personalities" and they still manifest in the finetunes.

memhole · 2025-03-12T14:41:51 1741790511

Yeah, that’s why I like the smaller models too. Big context windows and intelligent enough most of the time. They don’t follow instructions as well as the larger models ime. But then on the flip side the reasoning models struggle to deviate. I gave deepseek an existential crisis by accident the other day lol.

Agreed on personalities. Phi, I think because of the curated training data comes across as very dry.

jeswin · 2025-03-12T10:13:50 1741774430

I still find them useless. What do you use them for?

sebastiansm · 2025-03-12T11:23:50 1741778630

Anyone can recommend a small model specific for translation? english to spanish mostly.

pzo · 2025-03-12T11:45:04 1741779904

Depends what you mean small 4B? 7B? You can try qwen2.5 3B or 7B though 3B version is on no commercial license. Phi4-mini also should be good. Tested only on polish/english pairs should be good for spanish too. Smaller models like 1.5B were kind of useless for me.

archerx · 2025-03-12T13:33:14 1741786394

I haven't done deep testing on it but Tower-Babel_Babel-9B should be what you are looking for.

karma_fountain · 2025-03-12T11:19:17 1741778357

Ah, OpenThinker-7B. A diverse variety of LLM from the OpenThoughts team. Light and airy, suitable for everyday usage and not too heavy on the CPU. A new world LLM for the discerning user.

flir · 2025-03-12T11:58:21 1741780701

I find New World LLMs kinda... well, they don't have the terroir, ya know?

panki27 · 2025-03-12T11:56:06 1741780566

I've had really good results with Qwen2.5-7b-Instruct.

Do you have any recommendations for a "general AI assistant" model, not focused on a specific task, but more a jack-of-all-trades?

archerx · 2025-03-12T13:32:23 1741786343

If I could only use one model from now on it would either be the deepSeek R1 Qwen or Llama distill.

xnx · 2025-03-12T13:33:20 1741786400

Let us know when you've evaluated Gemma 3. Just as with the switch between ChatGPT 3.5 and ChatGPT 4, old versions don't tell you much about the current version.

tomp · 2025-03-12T09:32:19 1741771939

Any below 7B you'd recommend?

IME Qwen2.5-3B-Instruct (or even 1.5B) have been quite remarkable, but I haven't done that heavy testing.

archerx · 2025-03-12T10:33:12 1741775592

Try;

- EXAONE-3.5-2.4B-Instruct - Llama-3.2-3B-Instruct-uncensored - qwq-lcot-3b-instruct - qwen2.5-3b-instruct

These have been very interesting tiny models, they can do text processing task and can handle story telling. The Llama-3.2 is way to sensitive to random stuff so get the uncensored or abliterated versions

_1 · 2025-03-12T12:09:40 1741781380

How are you grading these? Are you going on feeling, or do you have a formalized benchmarking process?

archerx · 2025-03-12T13:36:06 1741786566

From just using them a lot and getting the results that I want without going "ugh!".

dudefeliciano · 2025-03-12T11:31:51 1741779111

what hardware are you using those on? Is it still prohibitively expensive to self-host a model that gives decent outputs (sorry my last experience has been underwhelming with llama a while back)

sliken · 2025-03-13T01:07:58 1741828078

I'm tinkering with gemma 3 27B on a last gen 12 core ryzen. I get 5 tokens/sec.

archerx · 2025-03-12T13:38:10 1741786690

I have an AMD 6700 XT card with 12gb of VRAM and a 24 core cpu with 48gigs of ram. This is the bare minimum,

michaelbuckbee · 2025-03-12T11:41:14 1741779674

What's the driving reason for local models? Cost? Censorship?

laborcontract · 2025-03-12T11:46:37 1741779997

PII is the driving force for me. I like to have local models manage my browser tabs, reply to emails, and go through personal documents. I don't trust LLM providers not to retain my data.

dannyw · 2025-03-12T11:42:37 1741779757

Privacy is another big reason. I like to store my files locally with a backup, not on Dropbox or whatever.