VeriWire
A proof-of-concept real-time voice workflow for bank wire confirmation
I designed and built the end-to-end voice workflow: Twilio Media Streams into a WebSocket bridge, Deepgram’s real-time voice agent for speech-to-text, text-to-speech, and GPT-4o-mini-driven conversation, a FastAPI bank sandbox, an explicit LangGraph decision model, and SQLite persistence for every call event. The system stays inside a simulated bank environment and was tested through live mobile calls using an active Twilio number during the project period.
Designed and built the complete proof of concept solo, including the bank sandbox API, LangGraph decision flow, voice bridge, verification logic, event storage, and automated tests.
Overview
VeriWire simulates a bank calling a customer to verify a suspicious wire transfer in a simulated bank environment. Twilio Media Streams sends call audio through a WebSocket bridge to Deepgram’s voice agent, which handles speech-to-text, text-to-speech, and turn-by-turn conversation through GPT-4o-mini. The agent runs a liveness challenge, a two-factor identity check against the card on file and a spoken phone number, a simulated deepfake-risk signal, and a grounded transaction lookup through tool calls into a FastAPI bank sandbox, before reaching a final approve, cancel, or escalate decision. A parallel LangGraph state machine models the same verification and decision flow in explicit, testable Python code.
Architecture
Twilio Media Streams carries telephony audio (mulaw, 8kHz) into a Python WebSocket bridge, which relays it to Deepgram’s real-time voice agent over a second WebSocket connection. Deepgram’s nova-3 model handles speech-to-text and its aura-2 model handles text-to-speech, while GPT-4o-mini, configured as the agent’s reasoning provider, drives the turn-by-turn conversation and decides when to call a tool. The agent’s policy runs each call through a liveness challenge with a randomly generated phrase, a two-factor identity check against the card last four and a spoken phone number with up to two reprompts before escalating, a simulated deepfake-risk score, a grounded transaction read-back, and a final approve, cancel, or escalate decision, calling out to a FastAPI bank sandbox for every payment lookup and status change. A parallel LangGraph state machine in veriwire/graph.py implements the same liveness, deepfake-check, identity, and decision flow as an explicit, independently testable graph. Each call is keyed by its Twilio stream ID in an in-memory session store, and every event is written to a SQLite-backed event log.
What I Built
I personally built the FastAPI bank sandbox, the WebSocket bridge between Twilio and Deepgram’s voice agent, the LangGraph decision model, the identity, liveness, and deepfake-risk checks, the tool clients, the SQLite persistence layer, and the pytest suite.
- Configured Deepgram’s voice agent with GPT-4o-mini as its reasoning provider, nova-3 for speech-to-text, and aura-2 for text-to-speech, then bridged it to Twilio Media Streams over a Python WebSocket server.
- Wrote the agent’s verification policy to check the card last four and a spoken phone number, reprompting up to two times per factor before escalating instead of allowing unlimited retries.
- Required transaction details to come from the bank sandbox through a tool call so the agent could not invent payment information from conversation context, and made approve and cancel idempotent with explicit 404 and 409 responses.
- Built a parallel LangGraph state machine that models the same liveness, deepfake-check, identity, and decision flow as explicit, independently testable Python code.
- Kept session state isolated per call, keyed by the Twilio stream ID, and wrote every call event into a SQLAlchemy-backed SQLite log.
Engineering Decisions
Deepgram’s voice agent over a custom STT/TTS pipeline
Why — A single Deepgram Agent connection bundled speech-to-text, text-to-speech, turn-taking, and function-calling over one WebSocket instead of wiring three separate services together.
Trade-off — The agent’s conversational policy lives in a natural-language system prompt rather than deterministic code, so the live call flow is shaped by Deepgram’s agent rather than executed step-by-step by the LangGraph graph.
A parallel LangGraph state machine for the same decision logic
Why — Modeling liveness, the deepfake check, identity verification, and the approve/cancel decision as an explicit graph made that logic independently testable outside of a live call.
Trade-off — The graph is not wired into the live Twilio/Deepgram call path, so it documents and tests the decision logic rather than executing it during an actual call.
Two-factor identity verification with bounded reprompts
Why — The agent verifies the card last four and a spoken phone number against the bank sandbox, reprompting up to two times per factor before escalating, so a caller could not brute-force either check.
Trade-off — The reprompt limit is enforced through the agent’s system-prompt policy rather than a hard-coded counter, so it depends on the underlying model following that instruction.
Tool calling for transaction lookup
Why — Fetching transaction details through a tool call kept the agent grounded in the bank sandbox rather than allowing it to construct payment details from memory.
DTMF as a documented override path
Why — The design called for DTMF keypresses to take priority over spoken intent, for example pressing 2 while saying "approve" resolves to cancel, giving callers a deterministic fallback when speech recognition was uncertain.
Trade-off — The direct DTMF-to-agent message path is currently disabled in the WebSocket bridge after a compatibility issue with Deepgram’s Agent API client-message format, so this fallback is documented and scaffolded but not active in the current build.
Per-call session isolation
Why — Keying session state by the Twilio stream ID prevented one caller’s verification path from leaking into another concurrent session.
Trade-off — The design required explicit session lifecycle management instead of relying on shared conversational state.
Results & Validation
Completed a working proof-of-concept flow that connected voice input, transaction lookup, two-factor identity verification, liveness checks, decision handling, and persistent event logging. The project was tested through live mobile calls using an active Twilio number during the project period, and five pytest files covered the bank data layer, the FastAPI sandbox endpoints, tool-client normalization, and the SQLite storage layer.
Five pytest files validated the bank data layer’s idempotent approve and cancel behavior, the FastAPI sandbox’s HTTP status codes, tool-client ID normalization, and the SQLite event log. The LangGraph decision flow and the live Twilio/Deepgram call path were validated through manual live calls rather than automated graph-level or integration tests, and the project documented an evaluation plan covering task-success rate, time-to-decision, and identity-verification pass rates for a future run.
Made approve and cancel idempotent through explicit HTTP 404 and 409 responses, and had the agent check payment status before asking for identity so a repeated request on an already-decided payment short-circuited to "already {status}" instead of re-running verification. A simulated risk spike, randomly generated above a fixed 0.78 threshold, triggered a freeze-and-escalate path instead of allowing automatic approval.
Ran the proof of concept in a simulated bank environment during the project period, with a separate FastAPI sandbox and a WebSocket voice bridge, and tested it through live mobile calls using an active Twilio number.