Hacker Newsnew | past | comments | ask | show | jobs | submit | Barathkanna's commentslogin

Agreed. Self-hosting gives the cleanest fixed cost, but you pay for it in ops and capacity planning. I’m mainly curious whether there’s a middle ground that gives early teams more predictable spend without immediately taking on full infra overhead.

Serverless GPU providers like Modal or RunPod are probably the closest thing. You pay for execution time rather than tokens so the unit economics are deterministic, and you don't have to manage the underlying capacity or OS. It is still variable billing but you avoid the token markup and the headache of keeping a cluster alive.

A realistic setup for this would be a 16× H100 80GB with NVLink. That comfortably handles the active 32B experts plus KV cache without extreme quantization. Cost-wise we are looking at roughly $500k–$700k upfront or $40–60/hr on-demand, which makes it clear this model is aimed at serious infra teams, not casual single-GPU deployments. I’m curious how API providers will price tokens on top of that hardware reality.

The weights are int4, so you'd only need 8xH100

You don't need to wait and see, Kimi K2 has the same hardware requirements and has several providers on OpenRouter:

https://openrouter.ai/moonshotai/kimi-k2-thinking https://openrouter.ai/moonshotai/kimi-k2-0905 https://openrouter.ai/moonshotai/kimi-k2-0905:exacto https://openrouter.ai/moonshotai/kimi-k2

Generally it seems to be in the neighborhood of $0.50/1M for input and $2.50/1M for output


Generally speaking, 8xH200s will be a lot cheaper than 16xH100s, and faster too. But both should technically work.

You can do it and may be ok for single user with idle waiting times, but performance/throughput will be roughly halved (closer to 2/3) and free context will be more limited with 8xH200 vs 16xH100 (assuming decent interconnect). Depending a bit on usecase and workload 16xH100 (or 16xB200) may be a better config for cost optimization. Often there is a huge economy of scale with such large mixture of expert models so that it would even be cheaper to use 96 GPU instead of just 8 or 16. The reasons are complicatet and involve better prefill cache, less memory transfer per node.

The other realistic setup is $20k, for a small company that needs a private AI for coding or other internal agentic use with two Mac Studios connected over thunderbolt 5 RMDA.

That won’t realistically work for this model. Even with only ~32B active params, a 1T-scale MoE still needs the full expert set available for fast routing, which means hundreds of GB to TBs of weights resident. Mac Studios don’t share unified memory across machines, Thunderbolt isn’t remotely comparable to NVLink for expert exchange, and bandwidth becomes the bottleneck immediately. You could maybe load fragments experimentally, but inference would be impractically slow and brittle. It’s a very different class of workload than private coding models.

People are running the previous Kimi K2 on 2 Mac Studios at 21tokens/s or 4 Macs at 30tokens/s. Its still premature, but not a completely crazy proposition for the near future, giving the rate of progress.

> 2 Mac Studios at 21tokens/s or 4 Macs at 30tokens/s

Keep in mind that most people posting speed benchmarks try them with basically 0 context. Those speeds will not hold at 32/64/128k context length.


If "fast" routing is per-token, the experts can just reside on SSD's. the performance is good enough these days. You don't need to globally share unified memory across the nodes, you'd just run distributed inference.

Anyway, in the future your local model setups will just be downloading experts on the fly from experts-exchange. That site will become as important to AI as downloadmoreram.com.


Depends on if you are using tensor parallelism or pipeline parallelism, in the second case you don't need any sharing.

RDMA over Thunderbolt is a thing now.

I'd love to see the prompt processing speed difference between 16× H100 and 2× Mac Studio.

Prompt processing/prefill can even get some speedup from local NPU use most likely: when you're ultimately limited by thermal/power limit throttling, having more efficient compute available means more headroom.

I asked GPT for a rough estimate to benchmark prompt prefill on an 8,192 token input. • 16× H100: 8,192 / (20k to 80k tokens/sec) ≈ 0.10 to 0.41s • 2× Mac Studio (M3 Max): 8,192 / (150 to 700 tokens/sec) ≈ 12 to 55s

These are order-of-magnitude numbers, but the takeaway is that multi H100 boxes are plausibly ~100× faster than workstation Macs for this class of model, especially for long-context prefill.


You do realize that's entirely made up, right?

Could be true, could be fake - the only thing we can be sure of is that it's made up with no basis in reality.

This is not how you use llms effectively, that's how you give everyone that's using them a bad name from association


That's great for affordable local use but it'll be slow: even with the proper multi-node inference setup, the thunderbolt link will be a comparative bottleneck.

TLDR: AI didn’t diagnose anything, it turned years of messy health data into clear trends. That helped the author ask better questions and have a more useful conversation with their doctor, which is the real value here.

TLDR: IPv4 is fully exhausted and no longer growing. Internet growth now depends on IPv6 adoption and address sharing, but IPv6 rollout is still uneven across regions.


I get why this exists and appreciate the transparency, but it still feels like a slippery middle ground. Age prediction avoids hard ID checks, which is good for privacy, yet it also normalizes behavioral inference about users that can be wrong in subtle ways. I’m supportive of the safety goal, but long term I’m more comfortable with systems that rely on explicit user choice and clear guardrails rather than probabilistic profiling, even if that’s messier to implement


What’s interesting here isn’t the humanoid form factor, it’s the systems integration. Plugging robots into Siemens’ industrial stack means they’re being treated like first-class nodes in existing logistics workflows, not special demos. If humanoids can reuse current automation software, safety models, and ops tooling, that lowers adoption friction a lot. The real question is whether reliability and MTBF get good enough to compete with simpler, non-humanoid automation at scale.


TLDR: Soft deletes look easy, but they spread complexity everywhere. Actually deleting data and archiving it separately often keeps databases simpler, faster, and easier to maintain.


That’s fair, and I probably didn’t explain it clearly. We’re building an AI API as a service platform aimed at early developers and small teams who want to integrate AI without constantly thinking about tokens at all.

I agree that token economics are basically a commodity today. The problem we’re trying to address isn’t beating the market on raw token prices, but removing the mental and financial overhead of having to model usage, estimate burn, and worry about runaway costs while experimenting or shipping early features. In that sense it’s absolutely an engineering and finance problem combined, and we’re intentionally tackling it at the pricing and API layer rather than pretending the underlying models are unique.


Would you just be... subsidizing low volume users? I am saying this isn't like a new problem in the grand scheme of things. hopefully I am not being too negative, do you have a site or something to learn more? It's not clear how you can have better token economics to provide me or someone else better token economics, rather than just burning more money lol.


Totally fair question, and you’re not being negative.

We’re not claiming better token economics in the sense of magically cheaper tokens, and we’re not just burning money to subsidize usage indefinitely. You’re right that this isn’t a new problem.

What we’re building is an AI API platform aimed at early developers and small teams who want to integrate AI without constantly reasoning about token math while they’re still experimenting or shipping early features. The value we’re trying to provide is predictability and simplicity, not beating the market on raw token prices. Some amount of cross-subsidy at low volumes is intentional and bounded, because lowering that early friction is the point.

If you want to see what we mean, the site is here: https://oxlo.ai Happy to answer questions or go deeper on how we’re thinking about this.


Oh you're arbing! I see now. Makes sense, seems like it could be useful if you have a rock solid DX.


Thank you!! We are definitely fully focused on Developer experience. Would love some feedback if it looks interesting


I agree with the core concern, but I think the right model is smaller, not zero. One or two strong technical writers using AI as a leverage tool can easily outperform a large writing team or pure AI output. The value is still in judgment, context, and asking the right questions. AI just accelerates the mechanics.


This mostly changes how location is requested, not what you can do with it. Instead of imperative JS calls, location access becomes declarative in HTML, which gives browsers more context for permission UX and auditing. Your app logic, data flow, and fallbacks don’t change, and you’ll still need JS to actually use the location. Think of it as a cleaner permission and intent layer, not a new geolocation capability.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: