Don’t Build A Second Product For Voice
Most voice demos are innocent because they have nothing to betray. They don’t have users yet. They don’t have saved conversations, permissions, model settings, memory, billing history, support tickets, or some weird old transcript that suddenly matters because the user said “wait, what was that thing from yesterday?”
A voice demo can be one request wearing a microphone. A voice product cannot.
That was the lesson I hit while adding Deepgram Voice Agent to RecallMEM, my local-first AI memory app. RecallMEM already had a text chat. It already had Postgres, pgvector, saved conversations, provider settings, memory extraction, and the usual pile of decisions that only exists because a demo accidentally became something I actually used.
So the job was not “add speech-to-text.” That would have been easy. Record audio, transcribe it, send text to an LLM, play speech back. Useful. Also not the point.
The real job was making voice belong to the app that already existed. If the text app remembers things, voice has to remember them too. If the text app saves transcripts, voice has to save transcripts. If the text app has settings, tools, auth boundaries, and user context, voice cannot float off into a cute separate session and pretend that counts.
That is how you end up with two products: the real app, and the shiny voice thing duct-taped next to it.
The Wrong Shape
The wrong architecture looks reasonable:
Browser mic
-> audio blob
-> app server
-> speech-to-text
-> LLM
-> text-to-speech
-> browser speaker
There’s nothing evil about this flow. For a weekend bot, it’s fine. For a product with existing state, it starts lying almost immediately.
The browser has the microphone, so it feels like the browser should own the experience. The model is the “agent,” so it feels like the model should own the context. The voice session is new, so it feels like it should get its own transcript. All of those instincts are convenient. All of them make the product worse.
The better shape looks more like this:
Browser mic
-> Deepgram Voice Agent
-> listens
-> handles turns
-> speaks back
-> asks for tools when needed
-> app-owned tool routes
-> memory
-> transcripts
-> settings
-> auth boundaries
-> normal app persistence
That boundary is the whole thing. Deepgram should be the live voice runtime: listening, turn-taking, tool calls, speech back to the user. Your app should still decide what the agent can see, what it can touch, what it should remember, and where the transcript goes.
The model talks. The product owns the truth.
The Browser Gets A Token, Not The Keys To The House
Voice feels client-side because the mic is client-side. That does not mean the browser gets to own the important parts.
In RecallMEM, the browser asks the app server for the Voice Agent config. The server keeps the long-lived Deepgram API key. It creates a short-lived browser token, builds the settings payload, and sends back only what the browser needs to start the session.
That line matters more than it sounds. Voice agents are not static widgets. They depend on user settings, available providers, memory rules, selected models, tool access, and whatever product context the app has already earned. If those rules leak into browser code, every product change becomes another little client-side trap.
The browser should capture audio and play audio. The server should know what the product is.
That is not security theater. It is architecture hygiene.
Settings Are Not Boilerplate
Deepgram Voice Agent is not “STT plus TTS.” The Settings message is where the live system gets assembled.
In RecallMEM, the settings define the audio format, listening model, thinking model, speaking voice, tools, and greeting.
const settings = {
type: "Settings",
audio: {
input: { encoding: "linear16", sample_rate: 16000 },
output: { encoding: "linear16", sample_rate: 48000, container: "none" },
},
agent: {
listen: {
provider: {
type: "deepgram",
version: "v2",
model: "flux-general-en",
keyterms,
},
},
think: thinkChain,
speak: speakChain,
greeting: "Hey, I'm here. What's up?",
},
};
The annoying parts are where the useful parts live. Flux was not just a model string swap. The listen provider also needed version: "v2", and the payload should not carry Nova-style options like smart_format. That is exactly the kind of tiny detail that feels beneath a blog post until it saves another developer an hour.
RecallMEM also disables slow local models for live voice. Local Gemma/Ollama can still work for text chat. They do not belong in the live voice path unless the user enjoys dead air.
That is not really an API decision. It is a product decision. Text can tolerate waiting. Voice can’t. In text, a slow answer feels like the app is thinking. In voice, silence feels like failure.
Don’t Start Streaming Just Because The Socket Opened
Realtime bugs have a special talent for making you feel stupid. The WebSocket can be open. The mic can be ready. The agent can still not be ready for your audio yet.
RecallMEM handles Welcome, sends Settings when the socket opens, and only streams microphone audio after SettingsApplied. That last gate matters more than the event order.
Browser mic
-> asks app server for a short-lived Deepgram token
-> opens Deepgram Voice Agent WebSocket
-> sends Settings on socket open
-> handles Welcome
-> waits for SettingsApplied
-> streams microphone audio
RecallMEM manages the WebSocket directly and uses Deepgram’s browser helpers for microphone capture and PCM playback. The important guard is boring:
const microphone = new AgentMicrophone((data) => {
const ws = wsRef.current;
if (!settingsAppliedRef.current || !ws || ws.readyState !== WebSocket.OPEN) {
return;
}
ws.send(data);
}, {
sampleRate: VOICE_INPUT_SAMPLE_RATE,
echoCancellation: true,
noiseSuppression: true,
autoGainControl: true,
});
That settingsAppliedRef check is not glamorous. It is the difference between “this usually works” and “why did my first turn vanish?”
Most voice demos hide lifecycle because lifecycle is ugly. The product cannot hide it. You need welcome, settings, settings applied, mic start, keepalive, interruption, playback, cleanup, reconnects, and enough state to explain what is happening when something goes wrong.
The lifecycle code is what makes the thing feel alive instead of haunted.
Memory Is A Tool, Not A Prompt Dump
This is where the architecture either gets useful or turns into soup.
RecallMEM already has memory. Text chat uses it. Voice has to use it too. The lazy move is to stuff the prompt with recent messages and hope the voice agent has enough context. That works until the conversation is long, the context is stale, or the user asks about something exact from three days ago.
The better move is to make memory a narrow tool.
When Deepgram sends a FunctionCallRequest, the browser calls RecallMEM’s memory endpoint. The app searches memory, formats the result, and sends a FunctionCallResponse back to Deepgram.
sendJsonMessage({
type: "FunctionCallResponse",
id: fn.id,
name: fn.name,
content,
});
The browser never touches Postgres. Server routes do. The memory tool route is the only part of the live tool-call flow that queries memory.
That route combines exact keyword search and semantic search. It also has a timeout on purpose.
Text chat can wait. Voice can’t. If retrieval takes too long, the user does not think, “Ah, pgvector is doing its best.” They think the agent stopped working.
One of my worst latency bugs came from starting a voice session inside a long chat and sending too much recent transcript context into Deepgram. It was technically “more context.” It was also worse. The fix was to keep startup context small: a few compact recent messages, shorter profile/rules text, fewer memory facts, then let search_memory pull older detail only when needed.
Small hot context. Fast tool retrieval. No giant transcript dump.
That is the pattern.
Voice Has To Write Back
If voice turns do not save back into the normal chat, the app has already split in two.
In RecallMEM, when Deepgram emits ConversationText, the app appends it into the same message list normal chat uses. Assistant turns go through the normal save path. Memory extraction runs after that, the same way it does for text.
That means a voice conversation can become future memory. No separate voice database. No special transcript format. No “the app remembers what I typed but forgets what I said.”
That sounds obvious. It is exactly the kind of obvious thing demos skip.
Voice is allowed to feel new to the user. It should not be new to the architecture.
The Bug That Made This Real
The first working version was not done.
The agent greeted out loud. Great. Then later turns came back as text only. Not great.
At first it looked like a model problem, or maybe a Deepgram problem, or maybe one of those haunted browser-audio bugs that make you question your career choices.
It was simpler than that. I had tied “ready for audio” too tightly to one event. The client needed to unlock playback on more than one valid signal:
case "AgentThinking":
allowNextAgentAudio();
setStatus("thinking");
break;
case "AgentStartedSpeaking":
stopPlayback(false);
setStatus("speaking");
break;
case "ConversationText":
if (role === "assistant") allowNextAgentAudio();
appendConversationText(role, content || "");
break;
The fix was small. The lesson was not.
A hello-world voice bot can pretend there is one clean path through the system. A real voice product has interruptions, stale audio, overlapping chunks, dead air during tool calls, reconnects, sample-rate mismatches, and people trying to use it in rooms that are not silent recording studios.
That is why the mute button mattered too. At first, mute felt like a nice extra. Then it became obvious that a voice agent you can’t control in a loud room is not a product. It is a demo waiting to embarrass you.
The Pattern
I would use this architecture when an app already has a text AI experience and voice needs access to the same tools, memory, transcripts, settings, or user context.
I would not use it for a throwaway voice bot. If all you need is “talk to a bot once,” keep it simple. Record, transcribe, respond, speak. Done.
But once the app has state, voice has to join that state.
That is the lesson here. Not “use this WebSocket.” Not “here is the right sample rate.” Those details matter, but they are not the argument.
The argument is that voice is not a layer you sprinkle over an AI app after the product is already built. Voice is another interface into the same product system. If you treat it like a separate thing, users will find the split immediately.
They will ask the voice agent about something the text app knows. They will expect the transcript to save. They will expect memory to update. They will expect settings to carry over. They will not care that your demo worked.
Deepgram gives you the realtime voice loop. That is the part you should not want to rebuild.
Your app still has to own the product.
That means memory stays in the app. Tools stay behind app boundaries. Tokens stay server-side. Transcripts save through the normal path. The model gets context, not ownership.
A voice demo can be one request wearing a microphone. A voice product has to belong to the system around it.