I’m still trying to understand what is the biggest group of people that uses local AI (or will)? Students who don’t want to pay but somehow have the hardware? Devs who are price conscious and want free agentic coding?
Local, in my experience, can’t even pull data from an image without hallucinating (Qwen 2.5 VI in that example). Hopefully local/small models keep getting better and devices get better at running bigger ones
It feels like we do it because we can more than because it makes sense- which I am all for! I just wonder if i’m missing some kind of major use case all around me that justifies chaining together a bunch of mac studios or buying a really great graphics card. Tools like exo are cool and the idea of distributed compute is neat but what edge cases truly need it so badly that it’s worth all the effort?
Privacy, both personal and for corporate data protection is a major reason. Unlimited usage, allowing offline use, supporting open source, not worrying about a good model being taken down/discontinued or changed, and the freedom to use uncensored models or model fine tunes are other benefits (though this OpenAI model is super-censored - “safe”).
I don’t have much experience with local vision models, but for text questions the latest local models are quite good. I’ve been using Qwen 3 Coder 30B-A3B a lot to analyze code locally and it has been great. While not as good as the latest big cloud models, it’s roughly on par with SOTA cloud models from late last year in my usage. I also run Qwen 3 235B-A22B 2507 Instruct on my home server, and it’s great, roughly on par with Claude 4 Sonnet in my usage (but slow of course running on my DDR4-equipped server with no GPU).
Add big law to the list as well. There are at least a few firms here that I am just personally aware of running their models locally. In reality, I bet there are way more.
A ton of EMR systems are cloud-hosted these days. There’s already patient data for probably a billion humans in the various hyperscalers.
Totally understand that approaches vary but beyond EMR there’s work to augment radiologists with computer vision to better diagnose, all sorts of cloudy things.
It’s here. It’s growing. Perhaps in your jurisdiction it’s prohibited? If so I wonder for how long.
In the US, HIPAA requires that health care providers complete a Business Associate Agreement with any other orgs that receive PHI in the course of doing business [1]. It basically says they understand HIPAA privacy protections and will work to fulfill the contracting provider's obligations regarding notification of breaches and deletion. Obviously any EMR service will include this by default.
Most orgs charge a huge premium for this. OpenAI offers it directly [2]. Some EMR providers are offering it as an add-on [3], but last I heard, it's wicked expensive.
I'm pretty sure the LLM services of the big general-purpose cloud providers do (I know for sure that Amazon Bedrock is a HIPAA Eligible Service, meaning it is covered within their standard Business Associate Addendum [their name for the Business Associate Agreeement as part of an AWS contract].)
Sorry to edit snipe you; I realized I hadn't checked in a while so I did a search and updated my comment. It appears OpenAI, Google, and Anthropic also offer BAAs for certain LLM services.
In the US, it would be unthinkable for a hospital to send patient data to something like ChatGPT or any other public services.
Might be possible with some certain specific regions/environments of Azure tho, because iirc they have a few that support government confidentiality type of stuff, and some that tout HIPAA compliance as well. Not sure about details of those though.
Possibly stupid question, but does this apply to things like M365 too? Because just like with Inference providers, the only thing keeping them from reading/abusing your data is a pinky promise contract.
Basically, isn't your data as safe/unsafe in a sharepoint folder as it is sending it to a paid inference provider?
I do think Devs are one of the genuine users of local into the future. No price hikes or random caps dropped in the middle of the night and in many instances I think local agentic coding is going to be faster than the cloud. It’s a great use case
I am extremely cynical about this entire development, but even I think that I will eventually have to run stuff locally; I've done some of the reading already (and I am quite interested in the text to speech models).
(Worth noting that "run it locally" is already Canva/Affinity's approach for Affinity Photo. Instead of a cloud-based model like Photoshop, their optional AI tools run using a local model you can download. Which I feel is the only responsible solution.)
I agree totally. My only problem is local models running on my old macMini run very much slower than that for example Gemini-2.5-flash. I have my Emacs setup so I can switch between a local model and one of the much faster commercial models.
Someone else responded to you about working for a financial organization and not using public APIs - another great use case.
These being mixture of expert (MOE) models should help. The 20b model only has 3.6b params active at any one time, so minus a bit of overhead the speed should be like running a 3.6b model (while still requiring the RAM of a 20b model).
Here's the ollama version (4.6bit quant, I think?) run with --verbose
total duration: 21.193519667s
load duration: 94.88375ms
prompt eval count: 77 token(s)
prompt eval duration: 1.482405875s
prompt eval rate: 51.94 tokens/s
eval count: 308 token(s)
eval duration: 19.615023208s
eval rate: 15.70 tokens/s
15 tokens/s is pretty decent for a low end MacBook Air (M2, 24gb of ram). Yes, it's not the ~250 tokens/s of 2.5-flash, but for my use case anything above 10 tokens/sec is good enough.
It's striking how much of the AI conversation focuses on new use cases, while overlooking one of the most serious non-financial costs: privacy.
I try to be mindful of what I share with ChatGPT, but even then, asking it to describe my family produced a response that was unsettling in its accuracy and depth.
Worse, after attempting to delete all chats and disable memory, I noticed that some information still seemed to persist. That left me deeply concerned—not just about this moment, but about where things are headed.
The real question isn't just "what can AI do?"—it's "who is keeping the record of what it does?" And just as importantly: "who watches the watcher?" If the answer is "no one," then maybe we shouldn't have a watcher at all.
> Worse, after attempting to delete all chats and disable memory, I noticed that some information still seemed to persist.
I'm fairly sure "seemed" is the key word here. LLMs are excellent at making things up - they rarely say "I don't know" and instead generate the most probable guess. People also famously overestimate their own uniqueness. Most likely, you accidentally recreated a kind of Barnum effect for yourself.
> I try to be mindful of what I share with ChatGPT, but even then, asking it to describe my family produced a response that was unsettling in its accuracy and depth.
> Worse, after attempting to delete all chats and disable memory, I noticed that some information still seemed to persist.
Maybe I'm missing something, but why wouldn't that be expected? The chat history isn't their only source of information - these models are trained on scraped public data. Unless there's zero information about you and your family on the public internet (in which case - bravo!), I would expect even a "fresh" LLM to have some information even without you giving it any.
Healthcare organizations that can't (easily) send data over the wire while remaining in compliance
Organizations operating in high stakes environments
Organizations with restrictive IT policies
To name just a few -- well, the first two are special cases of the last one
RE your hallucination concerns: the issue is overly broad ambitions. Local LLMs are not general purpose -- if what you want is local ChatGPT, you will have a bad time. You should have a highly focused use case, like "classify this free text as A or B" or "clean this up to conform to this standard": this is the sweet spot for a local model
This may be true for some large players in coastal states but definitely not true in general
Your typical non-coastal state run health system does not have model access outside of people using their own unsanctioned/personal ChatGPT/Claude accounts. In particular even if you have model access, you won't automatically have API access. Maybe you have a request for an API key in security review or in the queue of some committee that will get to it in 6 months. This is the reality for my local health system. Local models have been a massive boon in the way of enabling this kind of powerful automation at a fraction of the cost without having to endure the usual process needed to send data over the wire to a third party
That access is over a limited API and usually under heavy restrictions on the healthcare org side (e. g., only use a dedicated machine, locked up software, tracked responses and so on).
Running a local model is often much easier: if you already have data on a machine and can run a model without breaching any network one could run it without any new approvals.
HIPAA systems at any sane company will not have "a straight connect" to anything on Asure, AWS or GCP. They will likely have a special layer dedicated to record keeping and compliance.
Aren’t there HIPPA compliant clouds? I thought Azure had an offer to that effect and I imagine that’s the type of place they’re doing a lot of things now. I’ve landed roughly where you have though- text stuff is fine but don’t ask it to interact with files/data you can’t copy paste into the box. If a user doesn’t care to go through the trouble to preserve privacy, and I think it’s fair to say a lot of people claim to care but their behavior doesn’t change, then I just don’t see it being a thing people bother with. Maybe something to use offline while on a plane? but even then I guess United will have Starlink soon so plane connectivity is gonna get better
It's less that the clouds are compliant and more that risk management is paranoid. I used to do AWS consulting, and it wouldn't matter if you could show that some AWS service had attestations out the wazoo or that you could even use GovCloud -- some folks just wouldn't update priors.
If you're building any kind of product/service that uses AI/LLMs the answer is the same as why any company would want to run any other kind of OSS infra/service instead of relying on some closer proprietary vendor API.
Why not turn the question around. All other things being equal, who would prefer to use a rate limited and/or for-pay service if you could obtain at least comparable quality locally for free with no limitations, no privacy concerns, no censorship (beyond that baked into the weights you choose to use), and no net access required?
It's a pretty bad deal. So it must be that all other things aren't equal, and I suppose the big one is hardware. But neural net based systems always have a point of sharply diminishing returns, which we seem to have unambiguously hit with LLMs already, while the price of hardware is constantly decreasing and its quality increasing. So as we go further into the future, the practicality of running locally will only increase.
> I’m still trying to understand what is the biggest group of people that uses local AI (or will)?
Well, the model makers and device manufacturers of course!
While your Apple, Samsung, and Googles of the world will be unlikely to use OSS models locally (maybe Samsung?), they all have really big incentives to run models locally for a variety of reasons.
Latency, privacy (Apple), cost to run these models on behalf of consumers, etc.
This is why Google started shipping 16GB as the _lowest_ amount of RAM you can get on your Pixel 9. That was a clear flag that they're going to be running more and more models locally on your device.
As mentioned, it seems unlikely that US-based model makers or device manufacturers will use OSS models, they'll certainly be targeting local models heavily on consumer devices in the near future.
Apple's framework of local first, then escalate to ChatGPT if the query is complex will be the dominant pattern imo.
I’m highly interested in local models for privacy reasons. In particular, I want to give an LLM access to my years of personal notes and emails, and answer questions with references to those. As a researcher, there’s lots of unpublished stuff in there that I sometimes either forget or struggle to find again due to searching for the wrong keywords, and a local LLM could help with that.
I pay for ChatGPT and use it frequently, but I wouldn’t trust uploading all that data to them even if they let me. I’ve so far been playing around with Ollama for local use.
~80% of the basic questions I ask of LLMs[0] work just fine locally, and I’m happy to ask twice for the other 20% of queries for the sake of keeping those queries completely private.
[0] Think queries I’d previously have had to put through a search engine and check multiple results for a one word/sentence answer.
"Because you can and its cool" would be reason enough: plenty of revolutions have their origin in "because you can" (Wozniak right off the top of my head, Gates and Altair, stuff like that).
But uncensored is a big deal too: censorship is capability reducing (check out Kilcher's GPT4Chan video and references, the Orca work and Dolphin de-tune lift on SWE-Bench style evals). We pay dearly in capability to get "non-operator-alignment", and you'll notice that competition is hot enough now that at the frontier (Opus, Qwen) the " alignment" away from operators aligned is getting very, very mild.
And then there's the compression. Phi-3 or something works on a beefy laptop and has a nontrivial approximation of "the internet" that works on an airplane or a beach with no network connectivity, talk about vibe coding? I like those look up all the docs via a thumbdrive in Phuket vibes.
And on diffusion stuff, SOTA fits on a laptop or close, you can crush OG mid journey or SD on a macbook, its an even smaller gap.
Early GPT-4 ish outcomes are possible on a Macbook Pro or Razer Blade, so either 12-18 month old LLMs are useless, or GGUF is useful.
The AI goalposts things cuts both ways. If AI is "whatever only Anthropic can do"? That's just as silly as "whatever a computer can't do" and a lot more cynical.
Doing computation that can happen at end points at the end points is massively more scaleable. Even better, its done by compute you usually aren't paying for if you're the company providing the service.
I saw an interview with the guy who made photopea where he talked about how tiny his costs were because all compute was done in the user's browser. Running a saas in a cloud is expensive.
It's an underrated aspect of what we used to call "software".
And that's leaving aside questions of latency and data privacy.
Real talk. I'm based in San Juan and while in general having an office job on a beautiful beach is about as good as this life has to offer, the local version of Comcast (Liberty) is juuusst unreliable enough that I'm buying real gear at both the office and home station after a decade of laptop and go because while it goes down roughly as often as Comcast, its even harder to get resolved. We had StarLink at the office for like 2 weeks, you need a few real computers lying around.
I'm excited to do just dumb and irresponsible things with a local model, like "iterate through every single email in my 20-year-old gmail account and apply label X if Y applies" and not have a surprise bill.
People like myself that firmly believe there will come a time, possibly very soon that all these companies (OpenAI, Anthropic etc) will raise their prices substantially. By then no one will be able to do their work to the standard expected of them without AI, and by then maybe they charge $1k per month, maybe they charge $10k. If there is no viable alternative the sky is the limit.
Why do you think they continue to run at a loss? From the goodness of their heart? Their biggest goal is to discourage anyobe from running local models. The hardware is expensive... The way to run models is very difficult (for example I have dual rtx 3090 for vram and running large heavily quantized models is a real pain in the arse, no high quantisation library supports two GPUs for example, and there seems to be no interest in implementating it by the guys behind the best inference tools).
So this is welcome, but let's not forget why it is being done.
> no high quantisation library supports two GPUs for example, and there seems to be no interest in implementating it by the guys behind the best inference tools
I'm curious to hear what you're trying to run, because I haven't used any software that is not compatible with multiple GPUs.
A local laptop of the past few years without a discrete GPU can run, at practical speeds depending on task, a gemma/llama model if it's (ime) under 4GB.
For practical RAG processes of narrow scope and an even minimal amount of scaffolding a very usable speed for automating tasks, especially as the last-mile/edge device portion of a more complex process with better models in use upstream. Classification tasks, reasonay intelligent decisions between traditional workflow processes, other use cases-- a of them extremely valuable in enterprise, being built and deployed right now.
One of my favorite use cases includes simple tasks like generating effective mock/masked data from real data. Then passing the mock data worry-free to the big three (or wherever.)
There’s also a huge opportunity space for serving clients with very sensitive data. Health, legal, and government come to mind immediately. These local models are only going to get more capable of handling their use cases. They already are, really.
Data that can't leave the premises because it is too sensitive. There is a lot of security theater around cloud pretending to be compliant but if you actually care about security a locked server room is the way to do it.
I can provide a real-world example: Low-latency code completion.
The JetBrains suite includes a few LLM models on the order of a hundred megabytes. These models are able to provide "obvious" line completion, like filling in variable names, as well as some basic predictions, like realising that the `if let` statement I'm typing out is going to look something like `if let Some(response) = client_i_just_created.foobar().await`.
If that was running in The Cloud, it would have latency issues, rate limits, and it wouldn't work offline. Sure, there's a pretty big gap between these local IDE LLMs and what OpenAI is offering here, but if my single-line autocomplete could be a little smarter, I sure wouldn't complain.
> I’m still trying to understand what is the biggest group of people that uses local AI (or will)?
Creatives? I am surprised no one's mentioned this yet:
I tried to help a couple of friends with better copy for their websites, and quickly realized that they were using inventive phrases to explain their work, phrases that they would not want competitors to get wind of and benefit from; phrases that associate closely with their personal brand.
Ultimately, I felt uncomfortable presenting the cloud AIs with their text. Sometimes I feel this way even with my own Substack posts, where I occasionally coin a phrase I am proud of. But with local AI? Cool...
> I tried to help a couple of friends with better copy for their websites, and quickly realized that they were using inventive phrases to explain their work, phrases that they would not want competitors to get wind of and benefit from; phrases that associate closely with their personal brand.
But... they're publishing a website. Which competitors will read. Which chatbots will scrape. I genuinely don't get it.
I do it because 1) I am fascinated that I can and 2) at some point the online models will be enshitified — and I can then permanently fall back on my last good local version.
In some large, lucrative industries like aerospace many of the hosted models are off the table due to regulations such as ITAR. There'a a market for models which are run on prem/in GovCloud with a professional support contract for installation and updates.
I'm in a corporate environment. There's a study group to see if maybe we can potentially get some value out of those AI tools. They've been "studying" the issue for over a year now. They expect to get some cloud service that we can safely use Real Soon Now.
So, it'll take at least two more quarters before I can actually use those non-local tools on company related data. Probably longer, because sense of urgency is not this company's strong suit.
Anyway, as a developer I can run a lot of things locally. Local AI doesn't leak data, so it's safe. It's not as good as the online tools, but for some things they're better than nothing.
If you have capable hardware and kids, a local LLM is great. A simple system prompt customisation (e.g. ‘all responses should be written as if talking to a 10 year old’) and knowing that everything is private goes a long way for me at least.
Local micro models are both fast and cheap. We tuned small models on our data set and if the small model thinks content is a certain way, we escalate to the LLM.
This gives us really good recall at really low cloud cost and latency.
Everything is built in-house unfortunately. Many of our small models are turned Qwen3. But we mostly chose the model on SOTA at the time we needed a model trained.
I would say, any company who doesn't have their own AI developed. You always hear companies "mandating" AI usage, but for the most part it's companies developing their own solutions/agents. No self-respecting company with a tight opsec would allow a random "always-online" LLM that could just rip your codebase either piece by piece or the whole thing at once if it's a IDE addon (or at least I hope that's the case). So yeah, I'd say locally deployed LLM's/Agents are a gamechanger.
Jail breaking then running censored questions. Like diy fireworks, or analysis of papers that touch "sensitive topics", nsfw image generation the list is basically endless.
At the company where I currently work, for IP reasons (and with the advice of a patent lawyer), nobody is allowed to use any online AIs to talk about or help with work, unless it's very generic research that doesn't give away what we're working on.
That rules out coding assistants like Claude, chat, tools to generate presentations and copy-edit documents, and so forth.
But local AI are fine, as long as we're sure nothing is uploaded.
A small LLM can do RAG, call functions, summarize, create structured data from messy text, etc... You know, all the things you'd do if you were making an actual app with an LLM.
Yeah, chat apps are pretty cheap and convenient for users who want to search the internet and write text or code. But APIs quickly get expensive when inputting a significant amount of tokens.
There's a bunch of great reasons in this thread, but how about the chip manufacturers that are going to need you to need a more powerful set of processors in your phone, headset, computer. You can count on those companies to subsidize some R&D and software development.
>Students who don’t want to pay but somehow have the hardware?
that's me - well not a student anymore.
when toying with something, i much prefer not paying for each shot. my 12GB Radeon card can either run a decent extremely slow, or a idiotic but fast model. it's nice not dealing with rate limits.
once you write a prompt that mangles an idiotic model into still doing the work, it's really satisfying. the same principle as working to extract the most from limited embedded hardware. masochism, possibly
Some app devs use local models on local environments with LLM APIs to get up and running fast, then when the app deploys it switches to the big online models via environment vars.
In large companies this can save quite a bit of money.
Privacy laws. Processing government paperwork with LLMs for example. There's a lot of OCR tools that can't be used, and the ones that comply are more expensive than say, GPT-4.1 and lower quality.
anything involving the medical industry (HIPAA laws), national security (FedRAMP is such a pita to get that some military contractors are bypassing it to get quicker access to cloud tools) etc.
Besides that, we are moving towards an era where we won't need to pay providers a subscription every month to use these models. I can't say for certain whether or not the GPUs that run them will get cheaper, but the option to run your own model is game changing for more than you can possibly imagine.
Agencies / firms that work with classified data. Some places have very strict policies on data, which makes it impossible to use any service that isn't local and air-gapped.
I’d use it on a plane if there was no network for coding, but otherwise it’s just an emergency model if the internet goes out, basically end of the world scenarios
AI is going to to be equivalent to all computing in the future. Imagine if only IBM, Apple and Microsoft ever built computers, and all anyone else ever had in the 1990s were terminals to the mainframe, forever.
I am all for the privacy angle and while I think there’s certainly a group of us, myself included, who care deeply about it I don’t think most people or enterprises will. I think most of those will go for the easy button and then wring their hands about privacy and security as they have always done while continuing to let the big companies do pretty much whatever they want. I would be so happy to be wrong but aren’t we already seeing it? Middle of the night price changes, leaks of data, private things that turned out to not be…and yet!
Maybe I am too pessimistic, but as an EU citizen I expect politics (or should I say Trump?) to prevent access to US-based frontier models at some point.
Local, in my experience, can’t even pull data from an image without hallucinating (Qwen 2.5 VI in that example). Hopefully local/small models keep getting better and devices get better at running bigger ones
It feels like we do it because we can more than because it makes sense- which I am all for! I just wonder if i’m missing some kind of major use case all around me that justifies chaining together a bunch of mac studios or buying a really great graphics card. Tools like exo are cool and the idea of distributed compute is neat but what edge cases truly need it so badly that it’s worth all the effort?