Skip to content

Voice & Communication

SpeakNode uses WebRTC for real-time voice communication between users and AI agents.

How It Works

  1. A LiveKit room is created when an agent is dispatched
  2. The AI agent (Python worker) joins the room and starts listening
  3. The user connects via browser (widget or client) or phone
  4. Audio streams in both directions in real time
  5. The session is recorded automatically (audio egress)

Voice Pipeline

User speaks → WebRTC audio → STT (speech-to-text) → 
LLM processes text → generates response → 
TTS (text-to-speech) → WebRTC audio → User hears

Speech-to-Text (STT)

Converts user speech to text for the LLM. Supported providers:

  • Azure Speech Services
  • Other providers via configuration

Large Language Model (LLM)

Processes the conversation and generates responses. Supported providers:

  • OpenAI (GPT-4, etc.)
  • OpenRouter (access to multiple models)
  • Cloudflare AI

Text-to-Speech (TTS)

Converts agent responses to speech. Supported providers:

  • ElevenLabs
  • Azure Speech Services
  • Minimax

Voice Activity Detection (VAD)

Detects when the user starts and stops speaking. Controls turn-taking behavior — when the agent should start or stop talking.

Audio Recording

Every conversation is automatically recorded via LiveKit Egress. Recordings include:

  • Full audio of the session (composite recording)
  • Individual participant tracks
  • Stored in S3-compatible object storage

Real-Time Notifications

The platform uses SignalR to notify the frontend about session events:

  • Agent Ready — agent has joined the room
  • Session Completed — conversation ended, audio available

Session Lifecycle

Status Description
Pending Session created, room not yet ready
Dispatching LiveKit room created, waiting for agent to join
Active Agent joined, conversation in progress
Completed Conversation ended normally
Failed Error occurred during session