More

coder543 · 2026-04-20T17:34:34 1776706474

The description specifically says:

"Kimi-K2.6 adopts the same native int4 quantization method as Kimi-K2-Thinking."

coder543 · 2026-04-17T15:33:54 1776440034

From the page:

> Import from anywhere. Start from a text prompt, upload images and documents (DOCX, PPTX, XLSX), or point Claude at your codebase. You can also use the web capture tool to grab elements directly from your website so prototypes look like the real product.

atonse · 2026-04-17T16:10:01 1776442201

Thank you, I should RTFA next time.

coder543 · 2026-04-16T15:34:27 1776353667

Artificial Analysis hasn't posted their independent analysis of Qwen3.6 35B A3B yet, but Alibaba's benchmarks paint it as being on par with Qwen3.5 27B (or better in some cases).

Even Qwen3.5 35B A3B benchmarks roughly on par with Haiku 4.5, so Qwen3.6 should be a noticeable step up.

https://artificialanalysis.ai/models?models=gpt-oss-120b%2Cg...

No, these benchmarks are not perfect, but short of trying it yourself, this is the best we've got.

Compared to the frontier coding models like Opus 4.7 and GPT 5.4, Qwen3.6 35B A3B is not going to feel smart at all, but for something that can run quickly at home... it is impressive how far this stuff has come.

naasking · 2026-04-17T04:01:10 1776398470

Qwen models commonly get accused of benchmaxxing though. Just something to keep in mind when weighing the standard benchmarks.

coder543 · 2026-04-17T11:41:18 1776426078

Every model release gets accused of that, including the flagship models.

naasking · 2026-04-17T12:36:48 1776429408

Less so for Gemma-4 because it falls behind Qwen on benchmarks. Tests for benchmaxxing are also strongly suggestive: https://x.com/bnjmn_marie/status/2041540879165403527

coder543 · 2026-04-17T12:40:31 1776429631

No… seriously. Every model release is accused. Including Opus, GPT-5.4, whatever. And yes, including smaller models that are not the top in every benchmark.

My own experiences with Gemma 4 have been quite mediocre: https://www.reddit.com/r/LocalLLaMA/comments/1sn3izh/comment...

I would almost be tempted to call it benchmaxed if that term weren’t such a joke at this point. It is a deeply unserious term these days.

Gemma 4 is worse than its benchmarks show in terms of agentic workflows. The Qwen3.x models are much better; not benchmaxed. I have tested this extensively for my own workflows. Google really needs to release Gemma 4.1 ASAP. I really hope they’re not planning to just wait another calendar year like they did for Gemma 3 -> 4 with no intermediate updates.

And the lead author on the paper replied to that tweet to say that the scores would need to be greater than 80 to show actual contamination: https://x.com/MiZawalski/status/2043990236317851944?s=20

coder543 · 2026-04-16T15:17:14 1776352634

Not true. With a MoE, you can offload quite a bit of the model to CPU without losing a ton of performance. 16GB should be fine to run the 4-bit (or larger) model at speeds that are decent. The --n-cpu-moe parameter is the key one on llama-server, if you're not just using -fit on.

boppo1 · 2026-04-16T18:57:35 1776365855

I've been way out of the local game for a while now, what's the best way to run models for a fairly technical user? I was using llama.cpp in the command line before and using bash files for prompts.

adrian_b · 2026-04-16T22:17:17 1776377837

Running llama-server (it belongs to llama.cpp) starts a HTTP server on a specified port.

You can connect to that port with any browser, for chat.

Or you can connect to that port with any application that supports the OpenAI API, e.g. a coding assistant harness.

coder543 · 2026-04-09T16:03:58 1775750638

That is an extremely strange article, in my opinion. They test Gemma 4 31B, but they use Qwen3 32B, DeepSeek R1, and Kimi K2, which are all outdated models whose replacements were released long before Gemma 4? Qwen3.5 27B would have done far better on these tests than Qwen3 32B, and the same for DeepSeek V3.2 and Kimi K2.5. Not to mention the obvious absence of GLM-5.1, which is the leading open weight model right now.

The article also seems to brush over the discovery phase, which seems very important. If it were as easy as they say, then the models should have been let loose and we would see if they actually found these bugs, and how many false positives they marked as critical. Instead, they pointed the models at the flawed code directly.

coder543 · 2026-04-07T03:28:10 1775532490

Gemma 4 31B has now wiped out several of those models from the pareto frontier, now that it has pricing. Gemma 4 26B A4B has an Elo, but no pricing, so it still isn't on that chart. The Gemma 4 E2B/E4B models still aren't on the arena at all, but I expect them to move the pareto frontier as well if they're ever added, based on how well they've performed in general.

coder543 · 2026-04-03T01:20:21 1775179221

If you search the model card[0], there is a section titled "Code for processing Audio", which you can probably use to test things out. But, the model card makes the audio support seem disappointing:

> Audio supports a maximum length of 30 seconds.

[0]: https://huggingface.co/google/gemma-4-26B-A4B-it#getting-sta...

coder543 · 2026-04-03T01:07:36 1775178456

The E2B and E4B models support 128k context, not 256k, and even with the 128k... it could take a long time to process that much context on most phones, even with the processor running full tilt. It's hard to say without benchmarks, but 128k supported isn't the same as 128k practical. It will be interesting to see.

coder543 · 2026-04-02T20:30:28 1775161828

That Pareto plot doesn't seem include the Gemma 4 models anywhere (not just not at the frontier), likely because pricing wasn't available when the chart was generated. At least, I can't find the Gemma 4 models there. So, not particularly relevant until it is updated for the models released today.

coder543 · 2026-04-02T19:45:39 1775159139

There are issues with the chat template right now[0], so tool calling does not work reliably[1].

Every time people try to rush to judge open models on launch day... it never goes well. There are ~always bugs on launch day.

[0]: https://github.com/ggml-org/llama.cpp/pull/21326

[1]: https://github.com/ggml-org/llama.cpp/issues/21316

stavros · 2026-04-02T23:02:10 1775170930

What causes these? Given how simple the LLM interface is (just completion), why don't teams make a simple, standardized template available with their model release so the inference engine can just read it and work properly? Can someone explain the difficulty with that?

Yukonv · 2026-04-02T23:46:49 1775173609

The model does have the format specified but there is no _one_ standard. For this model it’s defined in the [ tokenizer_config.json [0]. As for llama.cpp they seem to be using a more type safe approach to reading the arguments.

[0] https://huggingface.co/google/gemma-4-31B-it/blob/main/token...

stavros · 2026-04-03T00:21:51 1775175711

Hm, but surely there will be converters for such simple formats? I'm confused as to how there can be calling bugs when the model already includes the template.

emidoots · 2026-04-02T21:32:27 1775165547

was just merged

coder543 · 2026-04-02T21:36:14 1775165774

It was just an example of a bug, not that it was the only bug. I’ve personally reported at least one other for Gemma 4 on llama.cpp already.

In a few days, I imagine that Gemma 4 support should be in better shape.