Voice AI + Memory

Speak2Me

Name: Speak2Me
Author: Chris Dabatos

Voice-first AI journal that actually LISTENS, remembers everything you've ever told it, and talks back like someone who knows you.

GitHub private Live App

What It Does

Open the app, tap the mic, and just... talk. About your day, your stress, your wins, whatever. The AI listens to your voice, picks up on how you're ACTUALLY feeling from your tone, and responds like a close friend who's been there for every conversation you've ever had.

Here's the thing that makes this different from every other AI chat out there: most of them start fresh every session. They don't know you. Speak2Me has a three-tier memory system that pre-loads your entire history before you even click start. It knows your partner's name, your goals, your recurring stress patterns, and that interview you mentioned last Tuesday. Zero recall delay. First message, full context.

Features

Voice Conversation

Real-time voice chat powered by Hume EVI (Empathic Voice Interface). It doesn't just read your words, it listens to HOW you say them. If your voice sounds sad but you say "I'm fine," it notices. There's also a text input fallback for when you can't talk out loud, plus mute/unmute and echo cancellation so it actually works in real environments.

Persistent Memory

This is the core of the whole thing. Three tiers working together:

Tier 1: Profile Summary. A structured summary of everything the AI knows about you. Auto-generated after each conversation by synthesizing all your facts into labeled sections.

IDENTITY: Alex, 34, Austin TX
FAMILY: Partner Jamie, Son Lucas (born Mar 2025)
WORK: Software engineer, side projects in AI
FINANCES: Maxing out Roth IRA, saving for house down payment
HEALTH: Back pain from desk setup, started physical therapy
CURRENT: Interviewing at two companies this month
PATTERNS: Anxiety spikes before interviews, cooking is a stress reliever

Tier 2: Quick Facts. Individual facts extracted from every single conversation. Stored in a dedicated table with categories (identity, family, work, finance, health, interest, project, social). Identity facts like names and birthdays are pinned and NEVER drop off.

CREATE TABLE s2m_user_facts (
  id VARCHAR(36) PRIMARY KEY,
  user_id VARCHAR(255) NOT NULL,
  fact_text TEXT NOT NULL,
  category VARCHAR(20) NOT NULL,
  source_entry_id VARCHAR(36),
  is_active BOOLEAN DEFAULT TRUE,
  superseded_by VARCHAR(36) NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Tier 3: Vector Search. Every conversation is chunked, embedded with OpenAI text-embedding-3-large, and stored in TiDB's vector column. Ask "what was I stressed about in January?" and it does a cosine similarity search across all your past conversations. Real answers from your own words.

Emotion Tracking

Hume's prosody model detects emotions from your voice in real time. Not sentiment analysis on text. Actual vocal patterns. Emotions are stored per conversation, aggregated into weekly trends on the dashboard. The AI responds to emotional signals naturally without narrating them like a robot.

Conversation History

Full transcripts saved with auto-save every 5 seconds. Browse by date with lazy loading. Ask natural language questions about your journal history through conversation insights. "On This Day" throwbacks from past entries. Copy and download any transcript.

Dashboard

Streak tracking, weekly emotion trends, smart memory carousel with upcoming date reminders, and milestone progress cards. Everything you need to see your patterns over time.

Dynamic Timezone

Browser detects your timezone automatically on page load and stores it in your profile. Fly to Seoul, and the AI says "good morning" at Seoul time. No config needed.

Stack

Component	Technology
Voice	Hume EVI (WebSocket streaming, emotion detection)
LLM	Claude Sonnet 4.6 (via Custom Language Model endpoint)
Database	TiDB Serverless (vector search + relational data)
Embeddings	OpenAI text-embedding-3-large
Background Jobs	Inngest (profile synthesis, cache rebuilding, transcript processing)
Auth	NextAuth.js (Google OAuth)
Frontend	Next.js 16, React, Tailwind CSS
Hosting	Vercel

Architecture

Voice Session

Mic capture
Hume SDK
Live transcript
Auto-save
Text fallback

Dashboard

Streak tracking
Emotion trends
Memory carousel
Milestones
On This Day

History

Browse by date
Transcript detail
Conversation insights

                     Hume EVI
                    WebSocket voice streaming
Prosody emotion detection
Turn detection
Calls CLM endpoint for every message

                

/api/hume/clm

Build system prompt w/ full context
Stream Claude response

/api/journal

Save session
Extract facts
Auto-save
Embed chunks

/api/memory

Cached profile
Quick facts
Recent context

                     Claude API
                    Sonnet 4.6 (conversation)
Haiku (fact extraction)
Opus (insights)

                

                     Inngest
                    Profile synthesis
Cache rebuild
Transcript processing

                

                     TiDB
                    User profiles
Journal entries
Facts table
Transcript chunks (vectors)
Voice profiles

                

How Memory Works

Most AI memory works like this: user says something, AI calls a memory API, waits 5-10 seconds, then responds. It's slow and it FEELS slow.

Speak2Me does it differently. When you open the app, your profile, facts, and recent context are all pre-fetched in the background before you click start. By the time the session begins, everything is already loaded. The AI has your full context on the very first message. Vector search only fires when you ask about something deep in your history.

User opens app    → Profile pre-fetched (background)
User clicks Start → Data already cached (zero wait)
User speaks       → CLM has full context on first turn
User asks about last month → Vector search (only when needed)

The CLM (Custom Language Model) Pattern

Hume EVI handles voice streaming and emotion detection, but it doesn't know your life story. The CLM endpoint bridges that gap.

When you speak, Hume transcribes your audio and detects emotions from your vocal prosody. Then it calls our CLM endpoint with the transcript and emotion scores. The endpoint runs 6 parallel queries (profile, vector search, recent chunks, active facts, last conversation timestamp, timezone), builds a system prompt with all that context plus the emotion data, and streams Claude's response back to Hume. Hume converts Claude's text to speech and plays it back.

Every single turn goes through this loop. The AI rebuilds its full context on every message so it never goes stale mid-conversation.

Fact Extraction

After each conversation ends, facts are extracted synchronously in about 500ms using Claude Haiku. It reads the transcript and pulls out:

Relationships: "Partner's name is Jamie"
Life events: "Son Lucas born March 2025"
Experiences: "Had sushi downtown with family"
Emotions: "Stressed about upcoming interview"
Intentions: "Wants to max out 401k this year"
Casual mentions: "Wants to try the new ramen place on 6th Street"

Facts are categorized, deduplicated against existing facts (70% word overlap threshold), and older versions get superseded by newer ones. A garbage filter rejects meta-observations like "user tested whether AI remembers" or "AI responded with." Because that's not a fact about YOU. That's noise.

Emotion Detection

Hume's prosody model analyzes vocal patterns and returns confidence scores for 48 emotions on every utterance. The top 3 are passed to Claude as context:

The user's voice shows: Sadness: 45%, Anxiety: 32%, Determination: 28%

The system prompt tells Claude to respond to the emotion without narrating it. If someone sounds sad but says they're fine, the AI might say "You say you're fine but I can hear it in your voice. What's really going on?" instead of "I detect that you're feeling sad." Because nobody talks like that.

API Endpoints

Endpoint	Method	Description
`/api/hume/clm/chat/completions`	POST	CLM endpoint (called by Hume, not browser)
`/api/hume/token`	GET	Get Hume access token for WebSocket
`/api/journal`	GET/POST	List entries / Save completed session
`/api/journal/autosave`	POST	Live transcript save (every 5s)
`/api/journal/insights`	POST	Ask questions about your history
`/api/journal/emotions`	GET	Emotion trend data
`/api/memory/profile`	GET	Cached profile + facts + recent context
`/api/user/timezone`	POST	Sync browser timezone to DB

File Structure

speak2me/
├── app/
│   ├── layout.tsx                    # Root layout with auth, timezone sync
│   ├── page.tsx                      # Dashboard (streaks, emotions, memory)
│   ├── journal/page.tsx              # Voice session page
│   ├── history/page.tsx              # Conversation history by date
│   ├── entry/[id]/page.tsx           # Individual entry detail
│   ├── settings/page.tsx             # User settings
│   └── api/
│       ├── hume/
│       │   ├── clm/chat/completions/ # Custom Language Model endpoint
│       │   └── token/                # Hume access token
│       ├── journal/
│       │   ├── route.ts              # Save/list entries + fact extraction
│       │   ├── autosave/             # Live transcript auto-save
│       │   ├── insights/             # Natural language history queries
│       │   ├── emotions/             # Emotion trend data
│       │   └── stats/                # Usage statistics
│       ├── memory/
│       │   └── profile/              # Cached profile + facts + context
│       └── user/
│           └── timezone/             # Browser timezone sync
├── components/
│   ├── voice/
│   │   └── voice-session.tsx         # Main voice UI
│   ├── dashboard/
│   │   ├── memory-callback.tsx       # Smart memory carousel
│   │   ├── emotion-trends.tsx        # Weekly mood chart
│   │   └── streak-display.tsx        # Streak counter
│   ├── timezone-sync.tsx             # Auto-syncs browser TZ to DB
│   └── ...
├── lib/
│   ├── prompts.ts                    # System prompt builder + fact selection
│   ├── memory.ts                     # Profile synthesis
│   ├── facts.ts                      # Fact storage, retrieval, dedup
│   ├── embeddings.ts                 # OpenAI embedding generation
│   ├── db.ts                         # TiDB connection
│   └── inngest/
│       └── functions.ts              # Background jobs (profile, cache, chunks)
└── scripts/
    ├── schema.sql                    # Full database schema
    ├── migrate-facts.ts              # Fact migration tool
    └── rebuild-cache.ts              # Cache rebuilder

Privacy

All data is per-user and authenticated via Google OAuth. Transcripts and AI responses are stored in TiDB (encryption planned). Emotion data is stored alongside transcripts and never shared. There are no third-party memory APIs involved. Voice audio is processed by Hume in real time and not stored.

Your data stays yours. That was the whole point of building this.

Known Limitations

Single user right now (multi-user auth in progress)
Hume's turn detection is a black box with no configurable thresholds
Emotion badges removed from live UI to reduce the "talking to a machine" feeling
Voice enrollment for speaker filtering exists but isn't wired into live sessions yet

Code Access

The repo is private. If you're a recruiter or hiring manager and want to walk through the code or architecture, I'm happy to do a live session. Just reach out.

Try it: speak2me.io

Chris Dabatos

Developer Advocate building AI-powered apps