AI Memory Is Broken. I Know Because I Built It.

Building Speak2Me, a voice-first AI journal that actually knows your story

I was talking to Claude the other day. Not about code, not about some technical problem. I was venting. About work, about life stuff, about things I wouldn't normally say out loud to an AI. And Claude responded with something so personal, so specific to MY situation, that I stopped scrolling and just stared at it.

It referenced my daughter by name. It brought up something I'd been stressed about from a conversation three weeks ago. It connected dots between things I'd said in completely separate chats.

And I thought: that response IS a product. If I could bottle that feeling of being truly heard and remembered by an AI, people would pay for it.

So I built Speak2Me. A voice-first AI journal companion. You talk to it like a friend, and it actually remembers your story. Not generic responses like "that sounds frustrating." Real, personal responses that reference your life, your people, your patterns.

It took me about two hours to build the first version. And then it took me the rest of the week to make it actually work. Because here's the thing nobody tells you about AI memory: it's really, REALLY hard to get right.

The Promise vs. The Reality

The idea was simple: you open the app and it just gets you. It remembers your wife's name, asks about that job stress you mentioned last week, and checks if the baby is finally sleeping through the night.

I wired everything up. Hume EVI for the voice (more on the voice echo nightmare later), Mem0 for long-term memory, TiDB for the database, Claude as the brain. Deployed it on Vercel. Sent the link to a few people. Felt pretty good about myself.

Then I used it for real. Like, actually sat down and talked to it about my day. Told it personal stuff. My income. My family. My goals for the next year.

Next session, I opened it up expecting this deeply personal experience.

It had no idea who I was.

Zero context. Completely blank. Like we'd never spoken. The entire product promise, the thing that makes this different from every other AI journal, was broken.

When Your Memory Layer Forgets

I was using Mem0 for long-term memory. If you haven't heard of it, Mem0 is an open-source memory framework that's blown up on GitHub. Like 40,000+ stars. The idea is great: you feed it conversations, it extracts important facts, and you can recall those facts later. Companies are building on it. VCs are funding it.

So I told the AI during a conversation: "I make $165,000 base salary with a $22,000 bonus." (Numbers changed for privacy, but the point stands.)

I checked what Mem0 actually stored from that conversation.

It extracted: "User wants to discuss their income, stating it was previously shared."

Read that again. I gave it EXACT NUMBERS. My salary. My bonus. Specific financial details that matter to me. And Mem0's internal model compressed that into a vague sentence about wanting to discuss income. It threw away the actual data.

This isn't a bug in Mem0's design. It's a limitation of how memory extraction works. Mem0 uses a smaller language model internally (GPT-4o-mini) to decide what's worth remembering. And smaller models are aggressive summarizers. They capture the gist and drop the specifics. For casual chatbot memory, that's probably fine. For a product where remembering someone's exact life details IS the value proposition, it's a dealbreaker.

I ran more tests. Told it about my family, my career plans, specific names and dates. Some things it captured. Others it mangled or skipped entirely. There was no way to predict what it would keep and what it would lose, because I don't control the extraction model. It's a black box.

If the memory layer is the product, I can't outsource it to someone else's black box.

Who is Lily?

While I was debugging the Mem0 issue, I made another mistake that could've been way worse.

To save money, I was using GPT-4o-mini to synthesize user profiles. The idea was to take all the conversations and generate a document that captures who the user is, what they care about, who's important in their life. That profile gets injected into every future conversation so the AI has context.

I ran the synthesis on my test conversations. Read the output.

It said my daughter's name was "Lily" and my partner was "Sarah."

My daughter's name is Cristel. My wife's name is Glenda.

GPT-4o-mini just... made up names. It had conversations where those names were never mentioned, and instead of writing "not yet mentioned," it fabricated plausible-sounding names and presented them as facts. Imagine opening your personal journal companion and hearing it say "How's Lily doing?" when your daughter's name is Cristel. That's not just a bug. That's a trust-destroying moment you can never recover from.

I immediately switched to Claude Haiku 4.5 for profile synthesis and added strict instructions: "NEVER invent, guess, or infer names, numbers, locations, or details that are not directly stated in the conversations. If a name or detail was not explicitly mentioned, write 'not yet mentioned' instead."

Haiku respects those constraints. It costs more. I don't care. Model choice for synthesis tasks isn't a cost decision. It's a trust decision. One hallucinated family member name and your user is gone forever.

Building My Own Memory System

After the Mem0 extraction failures and the hallucination scare, I rethought the entire memory architecture from scratch.

I needed three layers of memory, each serving a different purpose.

Layer one is the user profile. After every conversation, Claude Haiku 4.5 reads all past transcripts and generates a synthesized document. Who is this person? What do they do for work? Who are the important people in their life? What are they stressed about? What are their goals? This document gets injected into the system prompt for every future conversation. It's how the AI "knows" you before you say a word.

Layer two is per-exchange vector search. This is where the real breakthrough happened.

Originally, I was embedding entire conversation transcripts as single vectors. So a 20-minute conversation where I talked about my salary, then my weekend plans, then my sister's wedding, all became one vector. One point in mathematical space that represented the average of all those topics mashed together.

When I later asked "what did I say about my salary?" the search would find that conversation, sure. But it would also pull up every other long conversation because they all had similar blended vectors. The signal was diluted.

The fix was chunking. Instead of one vector per conversation, I split every conversation into individual exchanges. One user message plus the AI's response equals one chunk. Each chunk gets its own embedding vector. Now when I search for "salary," it finds the EXACT exchange where I discussed my salary. Not the whole conversation. The exact moment.

It's the difference between searching a book by its title versus having every single page indexed individually. The recall quality improvement was massive.

I'm using OpenAI's text-embedding-3-large model (3072 dimensions) and storing the vectors in TiDB, which supports vector search natively. When the AI needs to recall something during a live conversation, it searches these chunks using cosine distance. The cost is basically nothing. Less than ten cents per user per year for embeddings.

Layer three is the raw transcripts. Every word, stored unmodified. This is the ground truth that never gets summarized, compressed, or distorted by a model. If the profile synthesis misses something or the vector search returns a weird result, the raw data is always there.

I eventually ripped Mem0 out completely. Not because it's bad software. It's not. But once the three-layer system was working, Mem0 wasn't adding anything. It was just another dependency sitting between me and my data. The whole point of building my own memory layer was control. Keeping Mem0 around "just in case" defeated the purpose.

Why I Skipped Pinecone for TiDB

I mentioned I'm storing vectors in TiDB. That choice deserves its own section.

I work at PingCAP. Built this on paternity leave though, paying for my own usage. Could've used Supabase plus Pinecone. Didn't. Here's why.

Every RAG tutorial tells you the same thing: Postgres for your data, Pinecone for your vectors. Two databases. Two bills. Sync jobs between them.

Here's the query that runs when the AI needs to recall a memory:

SELECT
  e.title,
  e.top_emotions,
  c.chunk_text,
  VEC_COSINE_DISTANCE(c.embedding, ?) as relevance
FROM s2m_transcript_chunks c
JOIN s2m_journal_entries e ON c.entry_id = e.id
WHERE c.user_id = ?
  AND e.created_at > DATE_SUB(NOW(), INTERVAL 30 DAY)
ORDER BY relevance
LIMIT 5

Vector search. Date filter. User filter. JOIN to get the full context. One query. One network hop.

With Pinecone, that same operation looks like: call Pinecone with the vector, get back chunk IDs, call Postgres with those IDs, join the results in your application code. Two round trips. Two failure points. And you're doing the join in JavaScript instead of letting the database optimizer handle it.

But the real win is pre-filtering.

Vector search is expensive. Comparing your query vector against millions of stored vectors takes real compute. TiDB filters by user_id and date FIRST using regular indexes. Fast. Cheap. Then it runs the vector search on that smaller subset. Pinecone and most vector databases do it backwards. They search all vectors first, then filter out the ones that don't match your metadata. At scale, that difference matters.

The other thing that matters for AI agents: strong consistency.

During a conversation, the AI extracts a fact from what you said, stores it, and might need to recall it thirty seconds later in the same session. With a Postgres plus Pinecone setup, you're dealing with sync lag. Write to Postgres, trigger a job to update Pinecone, hope it finishes before the next recall. Eventual consistency headaches.

With TiDB, I write the embedding and it's immediately searchable. Same transaction. No lag. No sync jobs. No "read your own writes" bugs.

One database. Vectors next to the data they describe. Ship faster, debug easier.

The Latency Problem

I asked the AI about a frustration I'd shared earlier. It started talking immediately. Confident. Specific. And confidently wrong.

It hallucinated a restaurant I'd never been to. Made up details about a conversation that never happened. I sat there knowing it was fake because I never said any of that. Then, 10-20 seconds later, it corrected itself. "Oh wait, that's what you were talking about..."

That moment ruins everything.

The whole product promise is an AI that remembers you. But when it takes 30 seconds to think, when it guesses wrong first and corrects itself after, when you can feel it searching a database... the magic dies. You're not talking to something that knows you. You're talking to a computer that has to look you up.

I had a recall_memory tool wired up through Hume. It worked. The vector search found the right results. But Hume's voice AI is built for speed. It starts generating a response immediately, then injects the tool results when they arrive. So the AI would confidently say "Was it The Alchemist's Nook?" and then three seconds later go "Actually no, China Mama."

That's worse than not remembering at all.

I was in the shower talking through the problem out loud. Giving a talk to nobody. And it hit me: what if the AI already knows everything before I say a word?

So now when a session ends, Claude Haiku extracts the key facts synchronously. Takes about 500ms. Not just names and dates. The kind of stuff a friend would remember: "Had sushi at China Mama," "Stressed about the LangChain interview," "Wants to try that coffee shop." I call these quick_facts and save them directly on the journal entry.

When you open the app, before you even tap "Start Talking," the dashboard fetches your profile summary and the last 20 entries worth of quick_facts in the background. By the time you speak, the AI has everything in context. No tool calls. No waiting. No guessing.

Session End Session Start Memory Recall
Before Instant ~2s 5-10s (tool call)
After +500ms Instant Rarely needed

The recall_memory tool still exists for older stuff. "What did I say three months ago about..." But for anything recent, the AI just knows.

It costs more tokens. Way more. But the first time the AI remembered something instantly? No pause, no hallucination, no correction. Just... it knew.

That's the product.

The Voice Echo From Hell

Speak2Me is voice-first. You talk to it. It talks back. The entire experience is a real-time conversation powered by Hume EVI, which handles speech-to-text, emotion detection, LLM routing, and text-to-speech all in a single WebSocket connection.

When it works, it's magical. You're literally having a conversation with an AI that can hear the TONE of your voice, not just the words, and respond with appropriate emotion. Hume detects 48+ dimensions of vocal expression. So when you sound stressed, the AI doesn't just hear "I had a rough day." It hears the tension in your voice and adjusts its response accordingly.

But the echo.

When the AI speaks, its voice comes out of your phone's speaker. Your phone's microphone picks up that audio. The AI hears its own voice, thinks it's you talking, transcribes its own speech, and responds to itself.

Infinite. Feedback. Loop. The AI talking to itself, generating responses to its own words, forever.

I thought this would be a one-line fix. It wasn't.

On a native iOS app, the operating system has built-in acoustic echo cancellation at the hardware level. The OS knows what audio is coming out of the speaker and mathematically subtracts it from the mic input. It just works.

On a web app running in a mobile browser? You're at the mercy of whatever the browser implements. Chrome on desktop has decent echo cancellation. Mobile Safari is hit or miss. Some Android browsers barely try.

My first attempt was muting the microphone while the AI is speaking and unmuting when it stops. That technically works, but it kills the most natural part of voice conversation: the ability to interrupt. If you want to say "wait, actually, let me back up" while the AI is mid-sentence, you can't. The mic is muted.

What I ended up doing was using the browser's built-in audio constraints:

echoCancellation: true,
noiseSuppression: true,
autoGainControl: true

On desktop, this works great. The browser handles separating speaker output from mic input at a hardware level. On mobile, it's... acceptable at lower volumes. I added a volume slider that shows up during sessions with a recommended level of 40% for non-headphone use.

The real answer is headphones. Or a native iOS app where you get system-level echo cancellation. That's coming.

None of this is documented anywhere. I couldn't find a single blog post, Stack Overflow answer, or tutorial that addressed real-time voice AI echo in a web app. Everyone building voice apps is either doing it natively or doing simple one-shot speech-to-text where echo doesn't matter. Real-time, back-and-forth voice conversation on the web is still basically uncharted territory.

What's Next

Speak2Me is live at speak2me.io. But before anything else, encryption. Users are sharing their most personal thoughts. Journal transcripts and AI responses will be encrypted at rest in the database. That data needs to be protected like it matters, because it does.

After that, native iOS. The echo problem is fundamentally a web limitation and iOS gives you hardware-level acoustic echo cancellation that actually works. Plus push notifications, background audio, biometrics. Important for a personal journal.

And honestly? I need more people using this every day so I can see what the AI gets right and what it misses. The memory system will keep improving, but only with real conversation data flowing through it.

If you're a developer building anything with AI memory, I hope the failures I documented here save you some time. And if you want to try Speak2Me, go talk to it. Tell it something real. Then come back tomorrow and see if it remembers.

When it asks about Cristel instead of Lily? That's the product.


Chris Dabatos - Developer Advocate and Engineer

Chris Dabatos

Developer Advocate and content creator based in Las Vegas. He builds things with AI and writes about what breaks.

Sections
Intro
0:00 / 0:00