More

zeroxfe · 2026-04-02T13:15:03 1775135703

I've done this kind of thing many times with codex and sqlite, and it works very well. It's one prompt that looks something like this:

- inspect and understand the downloaded data in directory /path/..., then come up with an sqlite data model for doing detailed analytics and ingest everything into an sqlite db in data.sqlite, and document the model in model.md.

Then you can query the database adhoc pretty easily with codex prompts (and also generate PDF graphs as needed.)

I typically use the highest reasoning level for the initial prompt, and as I get deeper into the data, continuously improve on the model, indexes, etc., and just have codex handle any data migration.

zeroxfe · 2026-03-23T15:54:30 1774281270

Expiries are a defence-in-depth that exist primarily for crypt hygiene, for example to protect from compromised keys. If the private key material is well protected, the risk is very low.

However, an org (particularaly a .mil) not renewing its TLS certs screams of extreme incompetence (which is exactly what expiries are meant to protect you from.)

jp191919 · 2026-03-23T16:06:57 1774282017

>screams of extreme incompetence

Not unheard of with the military

cozzyd · 2026-03-23T17:02:10 1774285330

Precision lethality, not certificate renewality.

hoherd · 2026-03-23T17:57:40 1774288660

Let's not kid ourselves, the lethality isn't even that precise.

crote · 2026-03-23T20:57:05 1774299425

It is quite precise, it just isn't accurate.

On the one hand, they can do a perfect triple-tap. On the other hand, the perfect triple-tap hit a girls' school rather than a military base...

rectang · 2026-03-23T18:41:42 1774291302

"Why not neither?"

zeroxfe · 2026-03-15T22:29:46 1773613786

> Also MCP is very obviously dead, as any of us doing heavy agentic coding know.

As someone that does heavy agentic coding (using basically all the tools), this is so far from the truth. People claiming this have probably never worked in large enterprise environments where things like authentication, RBAC, rate limiting, abuse detection, centralized management/updates/ops, etc. are a huge part of the development and deployment workflow.

In these situations you can't just use skills and cli tools without a gigantic amount of retooling and increased operational and security complexity. MCP is really useful here, and allows centralized eng and ops teams to manage their services in a way that aligns with the organizations overall posture, policies, and infrastructure.

> Google is so far behind agentic cli coding. Gemini CLI is awful.

This part I totally agree. It's really hard to express how bad it is (and it's really disappointing.)

bloppe · 2026-03-16T01:18:20 1773623900

> you can't just use skills and cli tools without a gigantic amount of retooling and increased operational and security complexity

You're describing MCP. After all, MCP is just reinventing the OpenAPI wheel. You can just have a self-documenting REST API using OpenAPI. Put the spec in your context and your model knows how to use it. You can have all the RBAC and rate limiting and auth you want. Heck, you could even build all that complexity into a CLI tool if you want. MCP the protocol doesn't actually enable anything. And implementing an MCP server is exactly as complex as using any other established protocol if you're using all those features anyway

whattheheckheck · 2026-03-16T02:46:36 1773629196

The clients for mcp can drop a url and http in the mcp.json and get access to the application. Can the client do that for every rest api?

bloppe · 2026-03-16T02:52:13 1773629533

Ya, if you just use OpenAPI. That's why I'm saying MCP adds nothing. It's just another standard for documenting APIs. There are many that have been around for a long time and that are better integrated with existing ecosystems. There's also gRPC reflection. I'm sure there are others. LLMs can use them all equally effectively.

moritonal · 2026-03-15T23:36:33 1773617793

Given MCP is supposed to just be a standardised format for self-describing APIs, why are all the features you listed MCP related things? It sounds more like it's forced the enterprise to build such features which cli tooling didn't have?

rsalus · 2026-03-15T23:44:24 1773618264

mostly by virtue of being a common standard. MCP servers are primarily useful in a remote environment, where centralized management of cross-cutting concerns matters. also its really useful for integrating existing distributed services, e.g., internal data lakes.

I think it's clear a self-describing CLI is optimal for local-first tooling and portability. I personally view remote MCP servers as complementary in the space.

tomnipotent · 2026-03-16T00:32:48 1773621168

MCP's can hide most things behind an API.

zeroxfe · 2026-03-07T02:32:26 1772850746

> TBH I would have just rendered a font glyph, or failing that, grabbed an image.

If an LLM did that, people would be all up in arms about it cheating. :-)

For all its flaws, we seem to hold LLMs up to an unreasonably high bar.

marginalia_nu · 2026-03-07T02:46:35 1772851595

That's the job description for a good programmer though. Question assumptions and requirements, and then find the simplest solution that does the job.

Just about anyone can eventually come up with a hideously convoluted HeraldicImageryEngineImplFactory<FleurDeLis>.

zeroxfe · 2026-02-24T19:15:20 1771960520

> At some point you need to treat people as adults, which includes letting them make very bad decisions if they insist on doing so.

The world does not consist of all rational actors, and this opens the door to all kinds of exploitation. The attacks today are very sophisticated, and I don't trust my 80-yr old dad to be able to detect them, nor many of my non-tech-savvy friends.

> any more than it would be acceptable for a bank to tell an alcoholic "we aren't going to let you withdraw your money because we know you're just spending it at the liquor store".

This is a false equivalence.

bigstrat2003 · 2026-02-24T19:49:20 1771962560

It's not a false equivalence at all. Both situations are taking away someone's control of something that they own, borne from a paternalistic desire to protect that person from themselves. If one is acceptable, the other should be. Conversely if one is unacceptable, the other should be unacceptable as well. Either paternalistic refusal to let people do as they wish is ok, or it isn't.

NewsaHackO · 2026-02-24T20:36:12 1771965372

Maybe not, but I think that overextending any idea like that in the opposite direction of whatever point you are trying to make at least devolves into a "slippery slope" argument. For instance, is your point that all security on phones that impede freedom of the user (for instance, HTTPS, forced password on initial startup, not allowing apps to access certain parts of the phone without user permissions, verifying boot image signatures) should be removed as well?

bigstrat2003 · 2026-02-24T20:48:34 1771966114

No, that's not my point at all. Measures such as that are a tool which is in the hands of the user. There is a default restriction which is good enough for most cases, but the user has the ability to open things up further if he needs. What Google is proposing takes control out of the user's hands and makes Google the sole arbiter of what is and is not allowed on the device.

NewsaHackO · 2026-02-24T21:18:23 1771967903

None of the measures I mentioned are changeable by the user, except possibly sideloading an HTTPS certificate. That's the only way any of those measures even work; if it wasn't set as invariants by the OS, they would be bypassable.

>There is a default restriction which is good enough for most cases, but the user has the ability to open things up further if he needs.

But this is what the other guy's point is. You are defining "good enough for most cases" in a way that he is not, then making the argument that what he says is equivalent to not allowing an alcoholic to buy beer. Why can you set what level is an acceptable amount of restriction, but he can't?

array_key_first · 2026-02-24T21:06:06 1771967166

But it's not a slippery slope, because it's not taking it to the next level. It's the same level, just a different thing.

h3lp · 2026-02-24T21:04:31 1771967071

The alcoholic knows the bad outcomes, and chooses to ignore them. The hapless Android user does not understand the negative consequences of sideloading. I think this makes for a substantial differerence between those two.

bigbadfeline · 2026-02-25T01:16:39 1771982199

> The hapless Android user does not understand the negative consequences of sideloading.

Then make sideloading disabled by default but enable it when the users tap 7 times on whatever settings item. At that time, explain those "negative consequences" to them, explain them real good, don't spare anything and if they still hit "Yes, continue to enable sideloading" you do that immediately in order to avoid increasing their haplessness with other made-up excuses.

Simple.

kode-targz · 2026-02-25T19:38:49 1772048329

I don't see how people are against this. Especially tech-savvy people who browse HN. It really seems to me like everyone here who's on Google's side is just a bot in a botfarm somewhere. they can't possibly be real

sheiyei · 2026-02-24T20:39:40 1771965580

Protecting from scams isn't protection from the victim themselves. That should be obvious from the fact that very intelligent and technologically literate people too can fall for phishing attacks. Tell me for example, how many people in your life know how a bank would ACTUALLY contact you about a suspected hijacking and what the process should look like? And how about any of the dozens of other cover stories used? Not to mention the situations where the scammers can use literally the same method of first contact as the real thing (eg. spoofed). ...And the fact that for example email clients do their best to help them by obscuring the email address and only showing the display name, because that's obviously a good idea.

bigstrat2003 · 2026-02-24T20:46:31 1771965991

> Protecting from scams isn't protection from the victim themselves.

That is where we differ. It is, ultimately, the victim of a scam who makes the choice of "yes, this person is trustworthy and I will do what they say". The only way to prevent that is to block the user from having the power to make that decision, which is to say protecting them from themselves.

joshuamorton · 2026-02-24T22:17:43 1771971463

But the proposal here, requiring developers to register their identities, doesn't actually impact consumers at all. They still have the ability to make the decision about whether or not to trust someone.

kode-targz · 2026-02-25T19:46:08 1772048768

Yes it does, especially when you remember the fact that developers are also consumers. But even if they (we) weren't, it would still impact consumers. I, android user who's completely ignorant when it comes to android development or even mobile in general, would be heavily impacted by this. My custom youtube clients would never be approved by google. My (free) apps for watching anime and reading manga would never get approved by Google. And something that's approved today could stop being approved tomorrow. it's up to Google / Microsoft / Apple to decide after all, they're the ones in control of our devices. If they stop liking my open-source ad-free minesweeper game, then I can't play it anymore. I'll have to download their bloated proprietary version with ads and a subscription to keep playing.

joshuamorton · 2026-02-26T22:22:18 1772144538

> My custom youtube clients would never be approved by google. My (free) apps for watching anime and reading manga would never get approved by Google.

Google isn't approving apps though. A developer provides identity verification and a set of apps (apk names & keys) they are responsible for. There is no verification process or approval from google. The entire process as outlined in https://developer.android.com/developer-verification is that you prove you own signing keys for an apk name.

jrm4 · 2026-02-24T21:08:26 1771967306

None of these things requires "locking down phones." Every single thing you've mentioned can be done in a smarter way that doesn't involve "individuals aren't allowed to modify the devices they purchase."

sheiyei · 2026-02-25T20:42:19 1772052139

I'm very against the changes Google are doing, but I'm also against the claim that "people who get scammed are stupid and deserve it".

NewsaHackO · 2026-02-24T21:21:38 1771968098

You can't make a statement like that and provide no examples. What are some of your ideas for doing that?

zeroxfe · 2026-02-18T21:04:38 1771448678

> usual "true random number" bullshit

What's bullshit about it? This is how TRNGs in security enclaves work. They collect entropy from the environment, and use that to continuously reseed a PRNG, which generates bits.

If you're talking "true" in the philosophical sense, that doesn't exist -- the whole concept of randomness relies on an oracle.

wavemode · 2026-02-18T21:48:59 1771451339

What PRNGs lack compared to TRNGs is security (i.e. preventing someone from being able to use past values to predict future values). It's not that they somehow produce statistically invalid results (e.g. they generate 3s more often than 2s or something). Unless they're very poorly constructed.

refsys · 2026-02-18T22:11:26 1771452686

Maybe people have bad memories from linear congruential generators, these could go really bad (https://en.wikipedia.org/wiki/Marsaglia%27s_theorem)

adrian_b · 2026-02-19T13:01:32 1771506092

While LCGs are bad by themselves, they (together with Galois field counters, which have a large number of possible implementations, e.g. LFSRs, GFSRs, XorShift etc.) have some very desirable properties for a PRNG: known period, it is possible to make jumps through the sequence and it is possible to extract sub-sequences from it that are certain to not overlap, e.g. for a multithreaded simulation.

Because of this, the best non-cryptographic PRNGs are made from either a LCG or a GFC that ensures the properties mentioned above, together with a non-linear mixing function that scrambles the output, for much better statistical properties than a linear generator would have alone.

The good cryptographic RNGs have the same kind of structure, but where a one-way hash function or a block cipher function is used to scramble the output of a counter. The counter ensures in a simpler way the same properties as a LCG or GFC. A simple counter can be used here because the output mixing function is much more complex.

wtallis · 2026-02-18T21:22:58 1771449778

I don't think hardware random number generators are bullshit, but it's easy to overstate their importance. Outside of cryptography, there aren't a whole lot of cases that truly require that much care in how random numbers are generated. For the kind of examples the article opens with (web page A/B testing, clinical trials, etc.) you'll never have sample sizes large enough to justify worrying about the difference between a half-decent PRNG and a "true" random number generator.

zeroxfe · 2026-02-19T12:15:32 1771503332

Yes, agreed. In many cases, the determinism is a feature, particularly being able to store the seed for reproducibility.

zeroxfe · 2026-02-05T23:45:02 1770335102

> it's a waste of time to steer them

It's not a waste of time, it's a responsibility. All things need steering, even humans -- there's only so much precision that can be extrapolated from prompts, and as the tasks get bigger, small deviations can turn into very large mistakes.

There's a balance to strike between micro-management and no steering at all.

adw · 2026-02-06T04:29:50 1770352190

The prompt is decreasingly relevant. The verification environment you have is what actually matters.

freakynit · 2026-02-06T10:08:18 1770372498

I think this all comes down to information.

Most prompts we give are severely information-deficient. The reason LLMs can still produce acceptable results is because they compensate with their prior training and background knowledge.

The same applies to verification: it's fundamentally an information problem.

You see this exact dynamic when delegating work to humans. That's why good teams rely on extremely detailed specs. It's all a game of information.

adrianN · 2026-02-06T17:32:45 1770399165

Having prompts be information deficient is the whole point of LLMs. The only complete description of a typical programming problem is the final code or an equivalent formal specification.

freakynit · 2026-02-07T02:52:52 1770432772

Exactly the point. But, LLM's miss that human intuition part.

zeroxfe · 2026-02-04T15:59:59 1770220799

I've used both gVisor and microvms for this (at very large scales), and there are various tradeoffs between the two.

The huge gVisor drawback is that it __drastically_ slows down applications (despite startup time being faster.)

For agents, the startup time latency is less of an issue than the runtime cost, so microvms perform a lot better. If you're doing this in kube, then there's a bunch of other challenges to deal with if you want standard k8s features, but if you're just looking for isolated sandboxes for agents, microvms work really well.

zeroxfe · 2026-01-30T20:12:38 1769803958

It seems to work with OpenCode, but I can't tell exactly what's going on -- I was super impressed when OpenCode presented me with a UI to switch the view between different sub-agents. I don't know if OpenCode is aware of the capability, or the model is really good at telling the harness how to spawn sub-agents or execute parallel tool calls.

zeroxfe · 2026-01-30T19:18:58 1769800738

I've been using this model (as a coding agent) for the past few days, and it's the first time I've felt that an open source model really competes with the big labs. So far it's been able to handle most things I've thrown at it. I'm almost hesitant to say that this is as good as Opus.

rubslopes · 2026-01-31T00:51:10 1769820670

Also my experience. I've been going back and forth between Opus and Kimi for the last few days, and, at least for my CRUD webapps, I would say they are both on the same level.

armcat · 2026-01-30T19:27:42 1769801262

Out of curiosity, what kind of specs do you have (GPU / RAM)? I saw the requirements and it's a beyond my budget so I am "stuck" with smaller Qwen coders.

zeroxfe · 2026-01-30T20:06:33 1769803593

I'm not running it locally (it's gigantic!) I'm using the API at https://platform.moonshot.ai

BeetleB · 2026-01-30T20:08:55 1769803735

Just curious - how does it compare to GLM 4.7? Ever since they gave the $28/year deal, I've been using it for personal projects and am very happy with it (via opencode).

https://z.ai/subscribe

InsideOutSanta · 2026-01-30T20:27:14 1769804834

There's no comparison. GLM 4.7 is fine and reasonably competent at writing code, but K2.5 is right up there with something like Sonnet 4.5. it's the first time I can use an open-source model and not immediately tell the difference between it and top-end models from Anthropic and OpenAI.

Alifatisk · 2026-01-31T10:43:22 1769856202

Kimi k2.5 is a beast, speaks very human like (k2 was also good at this) and completes whatever I throw at it. However, the glm quarterly coding plan is too good of a deal. The Christmas deal ends today, so I’d still suggest to stick to it. There will always come a better model.

cmrdporcupine · 2026-01-30T20:42:12 1769805732

From what people say, it's better than GLM 4.7 (and I guess DeepSeek 3.2)

But it's also like... 10x the price per output token on any of the providers I've looked at.

I don't feel it's 10x the value. It's still much cheaper than paying by the token for Sonnet or Opus, but if you have a subscribed plan from the Big 3 (OpenAI, Anthropic, Google) it's much better value for $$.

Comes down to ethical or openness reasons to use it I guess.

esafak · 2026-01-30T20:55:00 1769806500

Exactly. For the price it has to beat Claude and GPT, unless you have budget for both. I just let GLM solve whatever it can and reserve my Claude budget for the rest.

zeroxfe · 2026-01-30T20:15:34 1769804134

It's waaay better than GLM 4.7 (which was the open model I was using earlier)! Kimi was able to quickly and smoothly finish some very complex tasks that GLM completely choked at.

segmondy · 2026-01-30T21:37:12 1769809032

The old Kimi K2 is better than GLM4.7

akudha · 2026-01-30T20:43:05 1769805785

Is the Lite plan enough for your projects?

BeetleB · 2026-01-30T21:23:58 1769808238

Very much so. I'm using it for small personal stuff on my home PC. Nothing grand. Not having to worry about token usage has been great (previously was paying per API use).

I haven't stress tested it with anything large. Both at work and home, I don't give much free rein to the AI (e.g. I examine and approve all code changes).

Lite plan doesn't have vision, so you cannot copy/paste an image there. But I can always switch models when I need to.

HarHarVeryFunny · 2026-01-31T13:32:32 1769866352

It is possible to run locally though ... I saw a video of someone running one of the heavily quantized versions on a Mac Studio, and performing pretty well in terms of speed.

I'm guessing a 256GB Mac Studio, costing $5-6K, but that wouldn't be an outrageous amount to spend for a professional tool if the model capability justified it.

tucnak · 2026-01-31T13:50:57 1769867457

> It is possible to run locally though

> running one of the heavily quantized versions

There is night and day difference in generation quality between even something like 8-bit and "heavily quantized" versions. Why not quantize to 1-bit anyway? Would that qualify as "running the model?" Food for thought. Don't get me wrong: there's plenty of stuff you can actually run on 96 GB Mac studio (let alone on 128/256 GB ones) but 1T-class models are not in that category, unfortunately. Unless you put four of them in a rack or something.

HarHarVeryFunny · 2026-02-02T17:07:01 1770052021

True, although the Mac Studio M3 Ultra does go up to 512GB (@ ~$10K) so models of this size are not too far out of reach (although I've no idea how useful Kimi K2.5 is compared to SOTA).

Kimi K2.5 is a MOE model with 384 "experts" and an active parameter count of only 32GB, although that doesn't really help reduce RAM requirements since you'd be swapping out that 32GB on every token. I wonder if it would be viable to come up with an MOE variant where consecutive sequences of tokens got routed to individual experts, which would change the memory thrashing from per-token to per-token-sequence, perhaps making it tolerable ?

jgalt212 · 2026-01-31T12:50:06 1769863806

What's the point of using an open source model if you're not self-hosting?

dimava · 2026-01-31T12:59:39 1769864379

Open source models costs are determined only by electricity usage, as anyone can rent a GPU qnd host them Closed source models cost x10 more just because they can A simple example is Claude Opus, which costs ~1/10 if not less in Claude Code that doesn't have that price multiplier

jgalt212 · 2026-01-31T13:50:07 1769867407

But Kimi seems so big that renting the necessary number of GPUs is a non trivial exercise.

pstuart · 2026-01-31T18:13:53 1769883233

Exactly! Electricity, hosting, and amortized cost of the GPUs would be the baseline costs.

oefrha · 2026-01-31T15:18:23 1769872703

Open source models can be hosted by provider, in particular plenty of educational institutions host open source models. You get to choose whatever provider you trust. For instance I used DeepSeek R1 a fair bit last year but never on deepseek.com or through its API.

elbear · 2026-01-31T12:56:54 1769864214

* It's cheaper than proprietary models

* Maybe you don't want to have your conversations used for training. The providers listed on OpenRouter mention whether they do that or not.

rc1 · 2026-01-30T21:06:30 1769807190

How long until this can be run on consumer grade hardware or a domestic electricity supply I wonder.

Anyone have a projection?

johndough · 2026-01-30T21:27:37 1769808457

You can run it on consumer grade hardware right now, but it will be rather slow. NVMe SSDs these days have a read speed of 7 GB/s (EDIT: or even faster than that! Thank you @hedgehog for the update), so it will give you one token roughly every three seconds while crunching through the 32 billion active parameters, which are natively quantized to 4 bit each. If you want to run it faster, you have to spend more money.

Some people in the localllama subreddit have built systems which run large models at more decent speeds: https://www.reddit.com/r/LocalLLaMA/

hedgehog · 2026-01-30T22:03:08 1769810588

High end consumer SSDs can do closer to 15 GB/s, though only with PCI-e gen 5. On a motherboard with two m.2 slots that's potentially around 30GB/s from disk. Edit: How fast everything is depends on how much data needs to get loaded from disk which is not always everything on MoE models.

greenavocado · 2026-01-31T00:50:43 1769820643

Would RAID zero help here?

hedgehog · 2026-01-31T01:43:33 1769823813

Yes, RAID 0 or 1 could both work in this case to combine the disks. You would want to check the bus topology for the specific motherboard to make sure the slots aren't on the other side of a hub or something like that.

heliumtera · 2026-01-30T21:22:20 1769808140

You need 600gb of VRAM + MEMORY (+ DISK) to fit the model (full) or 240 for the 1b quantized model. Of course this will be slow.

Through moonshot api it is pretty fast (much much much faster than Gemini 3 pro and Claude sonnet, probably faster than Gemini flash), though. To get similar experience they say at least 4xH200.

If you don't mind running it super slow, you still need around 600gb of VRAM + fast RAM.

It's already possible to run 4xH200 in a domestic environment (it would be instantaneous for most tasks, unbelievable speed). It's just very very expensive and probably challenging for most users, manageable/easy for the average hacker news crowd.

Expensive AND hard to source high end GPUs, if you manage to source for the old prices around 200 thousand dollars to get maximum speed I guess, you could probably run decently on a bunch of high end machines, for let's say, 40k (slow).

segmondy · 2026-01-30T21:38:50 1769809130

You can run it on a mac studio with 512gb ram, that's the easiest way. I run it at home on a multi rig GPU with partial offload to ram.

johndough · 2026-01-30T21:51:08 1769809868

I was wondering whether multiple GPUs make it go appreciably faster when limited by VRAM. Do you have some tokens/sec numbers for text generation?

Carrok · 2026-01-30T19:31:22 1769801482

Not OP but OpenCode and DeepInfra seems like an easy way.

observationist · 2026-01-30T23:46:12 1769816772

API costs on these big models over private hosts tend to be a lot less than API calls to the big 4 American platforms. You definitely get more bang for your buck.

kristianp · 2026-02-01T00:53:35 1769907215

Note that Kimi K2x is natively 4 bit int, which reduces the memory requirements somewhat.

kristianp · 2026-02-04T00:44:44 1770165884

Here's the citation for that, I think its not in the Technical Report. https://huggingface.co/moonshotai/Kimi-K2.5#4-native-int4-qu...

tgrowazay · 2026-01-30T20:08:21 1769803701

Just pick up any >240GB VRAM GPU off your local BestBuy to run a quantized version.

> The full Kimi K2.5 model is 630GB and typically requires at least 4× H200 GPUs.

CamperBob2 · 2026-01-30T22:00:03 1769810403

You could run the full, unquantized model at high speed with 8 RTX 6000 Blackwell boards.

I don't see a way to put together a decent system of that scale for less than $100K, given RAM and SSD prices. A system with 4x H200s would cost more like $200K.

ttul · 2026-01-31T05:31:20 1769837480

That would be quite the space heater, too!

timwheeler · 2026-01-31T05:39:00 1769837940

Did you use Kimi Code or some other harness? I used it with OpenCode and it was bumbling around through some tasks that Claude handles with ease.

zedutchgandalf · 2026-01-31T06:44:12 1769841852

Are you on the latest version? They pushed an update yesterday that greatly improved Kimi K2.5’s performance. It’s also free for a week in OpenCode, sponsored by their inference provider

ekabod · 2026-01-31T09:48:48 1769852928

But it may be a quantized model for the free version.

thesurlydev · 2026-01-30T19:23:00 1769800980

Can you share how you're running it?

eknkc · 2026-01-30T20:05:10 1769803510

I've been using it with opencode. You can either use your kimi code subscription (flat fee), moonshot.ai api key (per token) or openrouter to access it. OpenCode works beautifully with the model.

Edit: as a side note, I only installed opencode to try this model and I gotta say it is pretty good. Did not think it'd be as good as claude code but its just fine. Been using it with codex too.

Imustaskforhelp · 2026-01-30T20:20:04 1769804404

I tried to use opencode for kimi k2.5 too but recently they changed their pricing from 200 tool requests/5 hour to token based pricing.

I can only speak from the tool request based but for some reason anecdotally opencode took like 10 requests in like 3-4 minutes where Kimi cli took 2-3

So I personally like/stick with the kimi cli for kimi coding. I haven't tested it out again with OpenAI with teh new token based pricing but I do think that opencode might add more token issue.

Kimi Cli's pretty good too imo. You should check it out!

https://github.com/MoonshotAI/kimi-cli

nl · 2026-01-30T23:17:40 1769815060

I like Kimi-cli but it does leak memory.

I was using it for multi-hour tasks scripted via an self-written orchestrator on a small VM and ended up switching away from it because it would run slower and slower over time.

zeroxfe · 2026-01-30T20:05:18 1769803518

Running it via https://platform.moonshot.ai -- using OpenCode. They have super cheap monthly plans at kimi.com too, but I'm not using it because I already have codex and claude monthly plans.

esafak · 2026-01-30T20:58:13 1769806693

Where? https://www.kimi.com/code starts at $19/month, which is same as the big boys.

UncleOxidant · 2026-01-30T20:26:22 1769804782

so there's a free plan at moonshot.ai that gives you some number of tokens without paying?

JumpCrisscross · 2026-01-31T00:33:01 1769819581

> Can you share how you're running it?

Not OP, but I've been running it through Kagi [1]. Their AI offering is probably the best-kept secret in the market.

[1] https://help.kagi.com/kagi/ai/assistant.html

deaux · 2026-01-31T04:26:38 1769833598

Doesn't list Kimi 2.5 and seems to be chat-only, not API, correct?

lejalv · 2026-01-31T13:26:41 1769866001

> Doesn't list Kimi 2.5 and seems to be chat-only, not API, correct?

Yes, it is chat only, but that list is out of date - Kimi 2.5 (with or without reasoning) is available, as are ChatGPT 5.2, Gemini 3 Pro (Preview), etc

explorigin · 2026-01-30T19:34:01 1769801641

https://unsloth.ai/docs/models/kimi-k2.5

Requirements are listed.

KolmogorovComp · 2026-01-30T20:10:05 1769803805

To save everyone a click

> The 1.8-bit (UD-TQ1_0) quant will run on a single 24GB GPU if you offload all MoE layers to system RAM (or a fast SSD). With ~256GB RAM, expect ~10 tokens/s. The full Kimi K2.5 model is 630GB and typically requires at least 4× H200 GPUs. If the model fits, you will get >40 tokens/s when using a B200. To run the model in near full precision, you can use the 4-bit or 5-bit quants. You can use any higher just to be safe. For strong performance, aim for >240GB of unified memory (or combined RAM+VRAM) to reach 10+ tokens/s. If you’re below that, it'll work but speed will drop (llama.cpp can still run via mmap/disk offload) and may fall from ~10 tokens/s to <2 token/s. We recommend UD-Q2_K_XL (375GB) as a good size/quality balance. Best rule of thumb: RAM+VRAM ≈ the quant size; otherwise it’ll still work, just slower due to offloading.

Gracana · 2026-01-30T20:21:38 1769804498

I'm running the Q4_K_M quant on a xeon with 7x A4000s and I'm getting about 8 tok/s with small context (16k). I need to do more tuning, I think I can get more out of it, but it's never gonna be fast on this suboptimal machine.

segmondy · 2026-01-30T21:41:03 1769809263

you can add 1 more GPU so you can take advantage of tensor parallel. I get the same speed with 5 3090's with most of the model on 2400mhz ddr4 ram, 8.5tk almost constant. I don't really do agents but chat, and it holds up to 64k.

Gracana · 2026-01-30T21:58:52 1769810332

That is a very good point and I would love to do it, but I built this machine in a desktop case and the motherboard has seven slots. I did a custom water cooling manifold just to make it work with all the cards.

I'm trying to figure out how to add another card on a riser hanging off a slimsas port, or maybe I could turn the bottom slot into two vertical slots.. the case (fractal meshify 2 xl) has room for a vertical mounted card that wouldn't interfere with the others, but I'd need to make a custom riser with two slots on it to make it work. I dunno, it's possible!

I also have an RTX Pro 6000 Blackwell and an RTX 5000 Ada.. I'd be better off pulling all the A7000s and throwing both of those cards in this machine, but then I wouldn't have anything for my desktop. Decisions, decisions!

esafak · 2026-01-30T21:01:31 1769806891

The pitiful state of GPUs. $10K for a sloth with no memory.

indigodaddy · 2026-01-31T01:28:10 1769822890

Been using K2.5 Thinking via Nano-GPT subscription and `nanocode run` and it's working quite nicely. No issues with Tool Calling so far.

gigatexal · 2026-01-30T19:25:34 1769801134

Yeah I too am curious. Because Claude code is so good and the ecosystem so just it works that I’m Willing to pay them.

Imustaskforhelp · 2026-01-30T20:22:49 1769804569

I tried kimi k2.5 and first I didn't really like it. I was critical of it but then I started liking it. Also, the model has kind of replaced how I use chatgpt too & I really love kimi 2.5 the most right now (although gemini models come close too)

To be honest, I do feel like kimi k2.5 is the best open source model. It's not the best model itself right now tho but its really price performant and for many use cases might be nice depending on it.

It might not be the completely SOTA that people say but it comes pretty close and its open source and I trust the open source part because I feel like other providers can also run it and just about a lot of other things too (also considering that iirc chatgpt recently slashed some old models)

I really appreciate kimi for still open sourcing their complete SOTA and then releasing some research papers on top of them unlike Qwen which has closed source its complete SOTA.

Thank you Kimi!

epolanski · 2026-01-30T20:05:21 1769803521

You can plug another model in place of Anthropic ones in Claude Code.

zeroxfe · 2026-01-30T20:09:33 1769803773

That tends to work quite poorly because Claude Code does not use standard completions APIs. I tried it with Kimi, using litellm[proxy], and it failed in too many places.

xxr3376 · 2026-01-31T16:12:11 1769875931

You can try Kimi's Anthropic-compatible API.

Just connect Claude Code to Kimi's API endpoint and everything works well

https://www.kimi.com/code/docs/en/more/third-party-agents.ht...

AnonymousPlanet · 2026-01-30T20:59:27 1769806767

It worked very well for me using qwen3 coder behind a litellm. Most other models just fail in weird ways though.

samtheprogram · 2026-01-30T21:07:40 1769807260

opencode is a good alternative that doesnt flake out in this way.

miroljub · 2026-01-30T22:15:27 1769811327

If you don't use Antrophic models there's no reason to use Claude Code at all. Opencode gives so much more choice.