April 17, 20269 min

The local LLM revolution already happened. Nobody told the app developers.

A 2019 phone, airplane mode, and a document I wouldn’t paste into a cloud service.


The local LLM revolution already happened. Nobody told the app developers.

A 2019 phone, airplane mode, and a document I wouldn’t paste into a cloud service.

There’s a phone on my desk that nobody uses anymore. LG G8, from 2019. I just pasted a paragraph from a contract into an app on it — a contract I wouldn’t have pasted into a cloud service — and a few seconds later I got back a structured summary: TL;DR, three bullets, five key points, rendered as cards. The phone is in airplane mode. The app is a single HTML file. There is no network connection.

The reasoning is happening on the phone. A 1.5-billion-parameter language model, running locally, picking out what matters in the text and handing it back in the shape I asked for. Every call costs nothing. Every document stays on the device. There is nothing to phone home to.

That’s the whole demo. A phone most people would have thrown away, offline, running agentic AI that fits in your pocket and belongs to you. The interesting part isn’t the model.

Three things crossed a line

Three things changed in the last eighteen months, and almost nobody in mainstream software has noticed.

First, small language models got coherent. A 1.5-billion-parameter model in 2023 produced word salad; the same size in 2025 writes sentences you’d put your name on. Qwen 2.5, Gemma 3, SmolLM 2 — all of them at sizes that fit comfortably in a phone’s RAM, all of them capable of reasoning you’d have paid OpenAI for two years ago.

Second, grammar-constrained JSON output became a standard feature. This is the one nobody talks about, and it’s the one that matters. When a model is forced to emit output conforming to a schema you provide, it stops being a chatbot and starts being a reasoning engine you can call like a function. You hand it {word, hint, difficulty} and you get back exactly that shape, every time. That changes what's possible to build by several orders of magnitude.

Third, Ollama happened. One binary, one command, any model, running locally. The distribution problem — the single hardest thing about getting a local model onto someone’s machine — went from “impossible” to “one line.”

Put them together and the usability line moved. Local AI is finally usable on commodity hardware most people already own. Nobody in the mainstream app world has internalized it yet.

The shape the industry is stuck in

Open any AI app on your phone right now and notice the shape.

It’s a chat window. It talks to a server. It asks you for an account. It tracks usage. It has a subscription tier. Behind the scenes it’s routing your prompt to OpenAI or Anthropic or a hosted open-source model, billing per token, storing your history on someone else’s database. The UI is a chat window because the backend charges per request and chat is the lowest-friction way to rack up requests.

This shape is not neutral. It’s not what AI apps have to look like. It’s what they look like when the incentive structure requires your prompts to travel somewhere, your usage to be metered, your account to be retained. The chat window is what you get when the business model demands cloud routing.

The open-source world hasn’t broken out of it either. Ollama and LM Studio are chat UIs. The “local” apps people build on top of them are chat UIs. The agent frameworks are server-side, pointed at cloud APIs, running on a machine somewhere. When someone ships a “private AI” product, it usually means the cloud endpoint is technically theirs — not that the model is on the device in your pocket.

The industry is building one shape of AI app, over and over, because that’s the shape the economics demand. The shape has very little to do with what the technology is now capable of.

What it looks like when you build for the phone instead

Here’s what changes when you drop the cloud requirement.

An AI app becomes a single HTML file. It has some CSS and JavaScript inlined as <script> and <style> tags — that's it. The LLM lives on the phone, in a local server process. The app talks to it over localhost using structured JSON calls with schemas like {word, hint, difficulty} or {tldr, bullets, key_points}. The "cloud" layer is a process running on the same device, eight characters of URL: localhost:11434.

There is no chat window. There doesn’t have to be. The model is a reasoning engine you program against, not a conversation partner you chat with. A spelling game calls it twice per round — once to pick a word, once to judge the attempt — and renders its own UI for the rest. A document summarizer calls it once with the document pasted in and renders the structured output as cards. The model is function-shaped. You treat it like a function.

There is no account. There is no server bill. There is no “free tier” that will expire. There is no prompt logging on someone else’s machine. No rate limiting, no latency over the WAN, no data sovereignty question, no discontinuation risk, no platform provider who can revoke your access. The app is a file. The model is a binary. Both live on the phone you already own.

This is not a theoretical alternative. It’s running on my 2019 phone right now, offline, at 7.4 tokens per second. It’s running on a 2021 phone at 6.2 tokens per second with a 3-billion-parameter model that’s comfortably usable for structured reasoning. Both phones are unused hardware someone might have thrown away.

The constraint that AI apps must be chat windows backed by cloud APIs is not a law of physics. It’s a business model. You can just ignore it.

Receipts

I built the framework to test the hypothesis. It’s called olladroid.

One line in Termux on an Android phone installs Ollama inside a Debian proot, sets up the PWA launcher, and writes a runnable stack to the device. Pick a model that fits your RAM — qwen2.5:1.5b for 4–6 GB phones, qwen2.5:3b for 8+ GB, smollm2:360m if your phone is tight. Run olladroid new. It walks you through a slug, a template, and a model. A minute later, you have a working AI app under pwa/apps/<slug>/ that you can tap on the phone and use offline. The SDK is about 20 KB, inlined as a single <script> tag. The two reference templates are a spelling game for kids and a paste-in document summarizer, both under 200 lines of app-level code.

olladroid - the AI app framework that fits in one phoneScaffold personalised AI mini-apps that run entirely on an Android phone you already own. No cloud, no account, no data…s1dd4rth.github.io

The benchmark numbers are real, measured, and published per-device with exact prompt sets:

  • LG G8 ThinQ, Snapdragon 855, **qwen2.5:1.5b**: 7.4 tokens per second warm. [full report]
  • OnePlus 9R, Snapdragon 870, **qwen2.5:3b**: 6.2 tokens per second warm. [full report]

The benchmark script is in the repo. Anyone can run it and submit a result for a new device.

And the limits, because I’d rather you hear them from me than find out later. The rule of thumb is you need roughly twice the model download size in free RAM. A 3 GB phone runs nothing useful. The 4 GB floor is real. Gemma 4 in its e2b and e4b sizes (7.2 GB and 9.6 GB) doesn’t fit on phones under 12 GB in practice. Small models are fast but unreliable for structured JSON — gemma3:1b hits 9.6 tokens per second on the LG G8 and will still emit malformed output often enough to matter. Android 13+ needs the GitHub-release Termux APK because the F-Droid one has been broken for months. Two devices tested so far; if you have something else, running the bench and submitting the result is the single most useful contribution right now.

The gap this fills, and why it has to be you

Nobody with a growth target is going to ship this.

An app that runs offline, costs nothing per request, never calls home, and deletes cleanly has no business model anyone will fund. That’s not a conspiracy. It’s a straightforward consequence of how software is built for a living in 2026 — the incentives are on cloud, usage metering, retention, account creation, all the things that make private software impossible by construction.

The only people who are going to build personal AI apps are people who don’t need permission and aren’t chasing a business model. That’s a smaller pool than it sounds. It’s also the whole point. The apps I’d actually want on my phone — ones that know my quirks, ones that summarize the document I can’t send to a third party, ones that remember things without also remembering them for someone else — those only exist if someone builds them for themselves.

The framework is here. The models are here. The only thing missing is you.

What to do with this

Two things, depending on who you are.

If you build software: clone the repo, scaffold an app, run it on a phone you have lying around. The tutorial walks through it end to end. Writing your own template is about 200 lines. You will learn more about what local agentic AI actually is in an afternoon than you will from a year of cloud API tutorials.

If you watch the industry: the thing to track isn’t the next model release. It’s whether the app layer catches up to the model layer. My bet is that it will, eventually, but later than it should, and not from the companies you’d expect. The first hundred useful personal AI apps will mostly be built by hobbyists, for themselves, on hardware they already own.

Originally published on Medium