Speak2Me
Voice-first AI journal that actually LISTENS, remembers everything you've ever told it, and talks back like someone who knows you.
What It Does
Open the app, tap the mic, and just... talk. About your day, your stress, your wins, whatever. The AI listens to your voice, picks up on how you're ACTUALLY feeling from your tone, and responds like a close friend who's been there for every conversation you've ever had.
Here's the thing that makes this different from every other AI chat out there: most of them start fresh every session. They don't know you. Speak2Me has a three-tier memory system that pre-loads your entire history before you even click start. It knows your partner's name, your goals, your recurring stress patterns, and that interview you mentioned last Tuesday. Zero recall delay. First message, full context.
Features
Voice Conversation
Real-time voice chat powered by Hume EVI (Empathic Voice Interface). It doesn't just read your words, it listens to HOW you say them. If your voice sounds sad but you say "I'm fine," it notices. There's also a text input fallback for when you can't talk out loud, plus mute/unmute and echo cancellation so it actually works in real environments.
Persistent Memory
This is the core of the whole thing. Three tiers working together:
Tier 1: Profile Summary. A structured summary of everything the AI knows about you. Auto-generated after each conversation by synthesizing all your facts into labeled sections.
IDENTITY: Alex, 34, Austin TX
FAMILY: Partner Jamie, Son Lucas (born Mar 2025)
WORK: Software engineer, side projects in AI
FINANCES: Maxing out Roth IRA, saving for house down payment
HEALTH: Back pain from desk setup, started physical therapy
CURRENT: Interviewing at two companies this month
PATTERNS: Anxiety spikes before interviews, cooking is a stress reliever
Tier 2: Quick Facts. Individual facts extracted from every single conversation. Stored in a dedicated table with categories (identity, family, work, finance, health, interest, project, social). Identity facts like names and birthdays are pinned and NEVER drop off.
CREATE TABLE s2m_user_facts (
id VARCHAR(36) PRIMARY KEY,
user_id VARCHAR(255) NOT NULL,
fact_text TEXT NOT NULL,
category VARCHAR(20) NOT NULL,
source_entry_id VARCHAR(36),
is_active BOOLEAN DEFAULT TRUE,
superseded_by VARCHAR(36) NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Tier 3: Vector Search. Every conversation is chunked, embedded with OpenAI text-embedding-3-large, and stored in TiDB's vector column. Ask "what was I stressed about in January?" and it does a cosine similarity search across all your past conversations. Real answers from your own words.
Emotion Tracking
Hume's prosody model detects emotions from your voice in real time. Not sentiment analysis on text. Actual vocal patterns. Emotions are stored per conversation, aggregated into weekly trends on the dashboard. The AI responds to emotional signals naturally without narrating them like a robot.
Conversation History
Full transcripts saved with auto-save every 5 seconds. Browse by date with lazy loading. Ask natural language questions about your journal history through conversation insights. "On This Day" throwbacks from past entries. Copy and download any transcript.
Dashboard
Streak tracking, weekly emotion trends, smart memory carousel with upcoming date reminders, and milestone progress cards. Everything you need to see your patterns over time.
Dynamic Timezone
Browser detects your timezone automatically on page load and stores it in your profile. Fly to Seoul, and the AI says "good morning" at Seoul time. No config needed.
Stack
| Component | Technology |
|---|---|
| Voice | Hume EVI (WebSocket streaming, emotion detection) |
| LLM | Claude Sonnet 4.6 (via Custom Language Model endpoint) |
| Database | TiDB Serverless (vector search + relational data) |
| Embeddings | OpenAI text-embedding-3-large |
| Background Jobs | Inngest (profile synthesis, cache rebuilding, transcript processing) |
| Auth | NextAuth.js (Google OAuth) |
| Frontend | Next.js 16, React, Tailwind CSS |
| Hosting | Vercel |
Architecture
Voice Session
- Mic capture
- Hume SDK
- Live transcript
- Auto-save
- Text fallback
Dashboard
- Streak tracking
- Emotion trends
- Memory carousel
- Milestones
- On This Day
History
- Browse by date
- Transcript detail
- Conversation insights
Hume EVI
- WebSocket voice streaming
- Prosody emotion detection
- Turn detection
- Calls CLM endpoint for every message
/api/hume/clm
- Build system prompt w/ full context
- Stream Claude response
/api/journal
- Save session
- Extract facts
- Auto-save
- Embed chunks
/api/memory
- Cached profile
- Quick facts
- Recent context
Claude API
- Sonnet 4.6 (conversation)
- Haiku (fact extraction)
- Opus (insights)
Inngest
- Profile synthesis
- Cache rebuild
- Transcript processing
TiDB
- User profiles
- Journal entries
- Facts table
- Transcript chunks (vectors)
- Voice profiles
How Memory Works
Most AI memory works like this: user says something, AI calls a memory API, waits 5-10 seconds, then responds. It's slow and it FEELS slow.
Speak2Me does it differently. When you open the app, your profile, facts, and recent context are all pre-fetched in the background before you click start. By the time the session begins, everything is already loaded. The AI has your full context on the very first message. Vector search only fires when you ask about something deep in your history.
User opens app → Profile pre-fetched (background)
User clicks Start → Data already cached (zero wait)
User speaks → CLM has full context on first turn
User asks about last month → Vector search (only when needed)
The CLM (Custom Language Model) Pattern
Hume EVI handles voice streaming and emotion detection, but it doesn't know your life story. The CLM endpoint bridges that gap.
When you speak, Hume transcribes your audio and detects emotions from your vocal prosody. Then it calls our CLM endpoint with the transcript and emotion scores. The endpoint runs 6 parallel queries (profile, vector search, recent chunks, active facts, last conversation timestamp, timezone), builds a system prompt with all that context plus the emotion data, and streams Claude's response back to Hume. Hume converts Claude's text to speech and plays it back.
Every single turn goes through this loop. The AI rebuilds its full context on every message so it never goes stale mid-conversation.
Fact Extraction
After each conversation ends, facts are extracted synchronously in about 500ms using Claude Haiku. It reads the transcript and pulls out:
- Relationships: "Partner's name is Jamie"
- Life events: "Son Lucas born March 2025"
- Experiences: "Had sushi downtown with family"
- Emotions: "Stressed about upcoming interview"
- Intentions: "Wants to max out 401k this year"
- Casual mentions: "Wants to try the new ramen place on 6th Street"
Facts are categorized, deduplicated against existing facts (70% word overlap threshold), and older versions get superseded by newer ones. A garbage filter rejects meta-observations like "user tested whether AI remembers" or "AI responded with." Because that's not a fact about YOU. That's noise.
Emotion Detection
Hume's prosody model analyzes vocal patterns and returns confidence scores for 48 emotions on every utterance. The top 3 are passed to Claude as context:
The user's voice shows: Sadness: 45%, Anxiety: 32%, Determination: 28%
The system prompt tells Claude to respond to the emotion without narrating it. If someone sounds sad but says they're fine, the AI might say "You say you're fine but I can hear it in your voice. What's really going on?" instead of "I detect that you're feeling sad." Because nobody talks like that.
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/hume/clm/chat/completions | POST | CLM endpoint (called by Hume, not browser) |
/api/hume/token | GET | Get Hume access token for WebSocket |
/api/journal | GET/POST | List entries / Save completed session |
/api/journal/autosave | POST | Live transcript save (every 5s) |
/api/journal/insights | POST | Ask questions about your history |
/api/journal/emotions | GET | Emotion trend data |
/api/memory/profile | GET | Cached profile + facts + recent context |
/api/user/timezone | POST | Sync browser timezone to DB |
File Structure
speak2me/
├── app/
│ ├── layout.tsx # Root layout with auth, timezone sync
│ ├── page.tsx # Dashboard (streaks, emotions, memory)
│ ├── journal/page.tsx # Voice session page
│ ├── history/page.tsx # Conversation history by date
│ ├── entry/[id]/page.tsx # Individual entry detail
│ ├── settings/page.tsx # User settings
│ └── api/
│ ├── hume/
│ │ ├── clm/chat/completions/ # Custom Language Model endpoint
│ │ └── token/ # Hume access token
│ ├── journal/
│ │ ├── route.ts # Save/list entries + fact extraction
│ │ ├── autosave/ # Live transcript auto-save
│ │ ├── insights/ # Natural language history queries
│ │ ├── emotions/ # Emotion trend data
│ │ └── stats/ # Usage statistics
│ ├── memory/
│ │ └── profile/ # Cached profile + facts + context
│ └── user/
│ └── timezone/ # Browser timezone sync
├── components/
│ ├── voice/
│ │ └── voice-session.tsx # Main voice UI
│ ├── dashboard/
│ │ ├── memory-callback.tsx # Smart memory carousel
│ │ ├── emotion-trends.tsx # Weekly mood chart
│ │ └── streak-display.tsx # Streak counter
│ ├── timezone-sync.tsx # Auto-syncs browser TZ to DB
│ └── ...
├── lib/
│ ├── prompts.ts # System prompt builder + fact selection
│ ├── memory.ts # Profile synthesis
│ ├── facts.ts # Fact storage, retrieval, dedup
│ ├── embeddings.ts # OpenAI embedding generation
│ ├── db.ts # TiDB connection
│ └── inngest/
│ └── functions.ts # Background jobs (profile, cache, chunks)
└── scripts/
├── schema.sql # Full database schema
├── migrate-facts.ts # Fact migration tool
└── rebuild-cache.ts # Cache rebuilder
Privacy
All data is per-user and authenticated via Google OAuth. Transcripts and AI responses are stored in TiDB (encryption planned). Emotion data is stored alongside transcripts and never shared. There are no third-party memory APIs involved. Voice audio is processed by Hume in real time and not stored.
Your data stays yours. That was the whole point of building this.
Known Limitations
- Single user right now (multi-user auth in progress)
- Hume's turn detection is a black box with no configurable thresholds
- Emotion badges removed from live UI to reduce the "talking to a machine" feeling
- Voice enrollment for speaker filtering exists but isn't wired into live sessions yet
Code Access
The repo is private. If you're a recruiter or hiring manager and want to walk through the code or architecture, I'm happy to do a live session. Just reach out.
Try it: speak2me.io