The Cloud Model Has Limits
The default assumption about AI, for most of the past five years, is that it runs in the cloud: powerful models on powerful servers, accessed via API, with your data traveling to someone else’s infrastructure and back. This model has real advantages — no hardware requirements, always-current models, and no maintenance burden. It also has limits that are becoming increasingly visible as AI moves from novelty to infrastructure.
Privacy is the most cited concern, but it’s not the only one. Latency matters for real-time applications. Internet dependency matters in unreliable or restricted network environments. Cost matters when usage scales. And vendor lock-in matters for organisations building core products on AI infrastructure they don’t control. All of these concerns point toward running models locally — and the hardware and software required to do this have improved dramatically in the past two years.
What Has Changed
Quantisation techniques — methods for reducing model size without proportional quality loss — have made it possible to run genuinely capable language models on consumer hardware. A model that might have required a data centre GPU can now run on a laptop with a modern chip, or on a phone, at acceptable quality for many use cases. Tools like llama.cpp, Ollama, and LM Studio have abstracted away the technical complexity of running local models, making the setup accessible to developers without specialised ML expertise.
The quality of open-weight models — models whose weights are publicly available, meaning they can be downloaded and run locally — has also improved dramatically. Meta’s Llama family, Mistral’s releases, and many fine-tuned variants now offer quality that is competitive with commercial APIs for a wide range of tasks, at zero marginal cost per query.
Privacy and Compliance Use Cases
For organisations handling sensitive data — legal, medical, financial, government — the ability to run AI processing entirely on-premises, with no data leaving the network perimeter, is not a nice-to-have but a compliance requirement. The growth of local AI deployment in these sectors is being driven less by technical enthusiasm than by legal necessity: many regulated industries simply cannot use cloud AI services for sensitive data, regardless of the contractual protections offered.
The same logic applies to individuals who want to use AI tools for personal journaling, therapy-adjacent reflection, or sensitive creative work without the uncertainty of where that data goes and how it might be used. Local models provide a level of privacy guarantee that cloud services cannot match, regardless of their policies.
The Trade-offs Worth Understanding
Local models are not yet better than cloud frontier models for most tasks. GPT-4 and Claude Opus are currently more capable than the best local alternatives for complex reasoning, writing quality, and multi-step problem solving. The trade-off is: top quality versus privacy, cost, latency, and independence. For many use cases — text summarisation, code completion, document analysis, simple question answering — the quality gap is small enough that local models are the better choice when the other factors matter.
The trajectory is toward convergence. Local model quality is improving faster than cloud model quality, because there are more developers working on the optimisation problem and the hardware improvement curve hasn’t plateaued. The cases where cloud models have a decisive quality advantage are narrowing over time, and the cases where local deployment is preferable are growing.
Sources
- Meta AI. (2024). Llama Model Family. ai.meta.com/llama.
- Mistral AI. (2024). Open Models Documentation. docs.mistral.ai.
- Gerganov, G. (2024). llama.cpp. github.com/ggerganov/llama.cpp.