The LLM Was The Easy Part

Building a local AI app taught me the hard part was everything around the model.

I thought I was building a local chatbot.

Private conversations. Legal research. Sensitive notes. Things I did not want sitting on someone else's server. The plan sounded simple enough: run a model locally, save the chats locally, make it remember me.

From the outside, the model felt like the hard part. That is the part everyone talks about. Which model? How many parameters? What fits on the machine? How fast can it answer?

Then I installed Ollama, pulled a model, sent it a message, and got a response.

That part was basically done.

The LLM was not the product. It was a dependency with a chat-shaped API. The product was everything that had to happen before and after the model answered.

If an AI app only has to demo, you can fake a lot. If you plan to use it tomorrow, normal software engineering shows up immediately. Saving conversations. Loading them without corrupting them. Remembering facts without inventing them. Uploading PDFs. Rendering markdown. Deleting data for real. Letting someone install it on a machine that is not yours.

The model call was boring

Ollama made the first model call boring in the best way. Send messages to a local HTTP endpoint. Stream tokens back. Put those tokens in a chat bubble.

There were details, but they were contained. Gemma had thinking mode enabled by default, so it spent tokens writing out a reasoning block before answering simple prompts. Adding think: false made normal chat much faster. The 31B model was too slow for my taste, so I switched to a 26B mixture-of-experts model and used a smaller model for background tasks like title generation and fact extraction.

Those were real tuning problems. They were not the hard part.

The hard part was deciding what a conversation even is. A chat needs messages, titles, timestamps, model metadata, provider selection, attached files, partial saves while streaming, a stop button, a sidebar, pinned chats, renamed chats, and recovery when the browser closes at the worst possible time.

The model answered. The app had to behave.

Deletion was harder than building

My first instinct was to reuse Speak2Me, the production voice AI journal app I had already built. It had memory, prompts, facts, transcripts, embeddings, and a real product shape. I figured I could strip out the cloud parts and keep the good stuff.

That was wrong.

I started deleting Hume, Stripe, auth, Inngest, voice components, and production-only routes. Every file I deleted broke five files that imported it. The graph page depended on the journal page, which depended on the dashboard, which depended on the background pipeline. Within an hour the codebase looked like a half-disassembled engine.

That is the thing production code does. It optimizes for the product it became, not the smaller tool you wish it could turn into later.

I stopped fighting it and started fresh with a new Next.js app. The only thing I kept was a reference folder with the parts that were actually valuable: prompts, memory extraction logic, types, dates, and fact handling. Not imported. Just reference material.

That saved the project. The blank slate had less friction than a half-broken product.

The database fight was the warning

The next fight had nothing to do with language models.

I needed storage. SQLite was tempting because one file and zero setup is hard to beat. TiDB matched Speak2Me, but self-hosting a distributed database for a personal chatbot made no sense. YugabyteDB was interesting because it speaks Postgres and supports pgvector, but the Homebrew tap path was dead and the install options were more work than the app deserved.

The boring answer won: vanilla Postgres with pgvector.

That choice kept the future path open. Local Postgres today. Maybe managed Postgres later. Maybe distributed Postgres later. Same driver. Same SQL shape. Same vector operators.

Then Homebrew made it annoying. Postgres 16 did not have the pgvector files I needed. Postgres 17 had stale share directories and mismatched libraries left behind from prior installs. I ended up wiping the broken Postgres pieces and reinstalling cleanly before CREATE EXTENSION vector; finally worked.

It was not glamorous. It was also the foundation. If the database is wrong, memory is fake no matter how good the model is.

Local models still have knobs

Local AI sounds like one decision: run the model on your machine.

In practice, it is a pile of smaller decisions. Which binary is actually running? Which server is the CLI talking to? Is the desktop Ollama app running one version while Homebrew installed another? Is thinking mode on? Is the model dense or mixture-of-experts? How much prompt are you sending every turn?

I had Ollama installed twice. The server and client versions did not match. The symptom was not a clear error. A model pull just failed with a link to the download page buried in output. Once the versions matched, the same command worked.

That is local software. You get privacy and control. You also inherit everything about the machine.

None of this changed the product idea. It changed the product reality. A local AI app cannot just call a model and call itself done. It has to detect the environment, explain what is missing, recover from bad defaults, and avoid making the user debug your assumptions.

Memory made it a system

The first memory bug looked harmless until I checked the database.

I saved conversations as text transcripts. Each message started with user: or assistant:, and messages were separated by blank lines. That looked readable and simple.

Then markdown happened.

Assistant replies have blank lines inside them. Headings, lists, paragraphs, code explanations. My loader split the transcript on blank lines, kept blocks that started with assistant:, and dropped continuation blocks without a prefix. The full response was in Postgres, but loading the chat silently threw away most of it.

Worse, if the user kept chatting after a reload, the truncated conversation could get saved again. A parser bug became memory loss.

The fix was small: continuation blocks attach to the previous message. The lesson was larger: serialization is a contract. If your save format and load format do not agree, your memory system will lie quietly.

Memory also changed when data needed to be flushed. Background extraction felt elegant until I clicked New Chat two seconds after a response and the next chat loaded before the facts were saved. Async is fine until the next operation depends on the result. At that point you need a boundary. For RecallMEM, changing chats became that boundary.

The interface was not polish

I used to think of chat UI details as polish.

They are not.

A stop button is not polish when a local model can get stuck thinking for a minute. Sticky scroll is not polish when streaming tokens keep yanking the page down while you are reading an earlier answer. Copy buttons are not polish when the app is used for legal research, code, or long technical explanations. Draft recovery is not polish when a browser refresh can throw away the message you were writing.

PDF upload was the same kind of lesson. Text extraction worked until the library changed its API. Then the worker file was missing from the Next.js bundle. Then scanned pages and diagrams needed vision, not just text. Suddenly file upload was not an attachment feature. It was an ingestion pipeline.

The model did not care about any of this. The user did.

Installability is product

RecallMEM worked on my machine for the least interesting reason: my machine had months of development state on it.

Postgres was installed. pgvector worked. Ollama was running. Models were already pulled. Environment variables existed. Weird one-time setup problems had been solved and forgotten.

Then I tested the npm package on a clean laptop and hit three showstoppers in 30 minutes. Ollama was installed but not running. The model picker was skipped because of a setup flag. Background memory extraction used a hardcoded fast model that the laptop did not have.

The app was not installable. It was lucky.

That is why the CLI became part of the product. npx recallmem had to detect Postgres, pgvector, Ollama, the database, services, models, migrations, and environment files. The npm package had to stay tiny, so the CLI ships as a small bootstrapper and clones the real app only when needed.

The published package was about 22KB. That mattered more than I expected. The first experience of a local AI app is not the chat screen. It is whether setup respects the user's time.

What I learned

The LLM was the easy part because APIs are easy compared to durable behavior.

The hard part was not getting a model to respond. The hard part was making the app remember correctly, store safely, delete honestly, recover from reloads, explain setup failures, avoid surprise token bills, and work on a machine I had never touched.

That is what AI apps become once they stop being demos. They are not prompts with a UI. They are ordinary software systems with one very strange dependency in the middle.

That dependency can write, reason, summarize, and surprise you. It can also be slow, expensive, wrong, or missing from the user's laptop entirely.

The job is everything around it.

The LLM was the easy part. The product was the rest of the system.


Chris Dabatos - DevRel Engineer

Chris Dabatos

DevRel Engineer, Builder, and Technical Storyteller based in Las Vegas. He builds things with AI and writes about what breaks.

Sections
Intro
0:00 / 0:00