May 5, 2026

Build a voice research agent with Render Workflows and AssemblyAI

Voice interfaces break when background work is slow. Learn how to separate your voice channel from LLM orchestration using AssemblyAI, Render Workflows, and Mastra.

Ryan Seams

VP, Customer Solutions

Voice Agent API

Reviewed by

Table of contents

[Visible on live site]

Building voice interfaces for complex tasks is hard. Speech input and output need to feel instant, but the work happening in the background — LLM calls, multi-stage search, synthesis — can take real time. If you block the voice channel waiting on tool execution, you end up with awkward pauses and brittle sessions.

This tutorial walks through a reference architecture that solves this problem by separating the voice channel from background orchestration. You'll use AssemblyAI's Voice Agent, Render Workflows, Mastra agents, and You.com search to build a voice research experience that reliably completes in under 60 seconds.

All the code for this project is available on GitHub or you can try it here.

Why not just let the LLM handle everything?

The typical approach to voice-first apps looks like this:

Stream audio from the browser to a voice API
The voice API calls an LLM
The LLM runs tool calls (search, extract, synthesize) directly

This works fine for simple commands like "what time is it?" But for anything heavier, it falls apart quickly. A single LLM turn blocked by multiple search calls stalls the voice channel. One failed search can drop the whole session. And you get no durable execution record, no retry logic, no way to replay a failed stage.

The fix is to separate the voice channel from the orchestration. The voice agent stays in its own lightweight task. Everything else — classification, planning, search, synthesis, verification — runs as a chain of discrete, retry-able workflow tasks with their own compute, timeouts, and logs.

Architecture overview

The demo runs two services on Render:

Web service — manages sessions, connects browser WebSockets to workflow tasks via a reverse-WS broker, and streams real-time progress via Server-Sent Events (SSE).

Workflow service — executes a chain of isolated tasks: voice_session, classify_ask, plan_queries, search_branch, synthesize, and verify.

When a user opens the mic, the Web service dispatches a voice_session task. That task opens an AssemblyAI WebSocket, tunnels audio to and from the browser, and — when the user asks a research question — triggers the rest of the research pipeline.

Orchestrating with render workflows

The Web service uses the Render SDK to kick off the root task:

export async function startVoiceSession(
  sessionId: string,
  token: string,
  publicWebUrl: string,
): Promise<string> {
  const { taskRunId } = await render.workflows.startTask(
    `${WORKFLOW_SLUG}/voice_session`,
    [{ sessionId, token, publicWebUrl }],
  );
  return taskRunId;
}

Inside voice_session, each subsequent stage is defined as its own workflow task with a dedicated compute plan, timeout, and retry policy:

export const classifyAsk = task(
  {
    name: "classify_ask",
    plan: "starter",
    timeoutSeconds: 30,
    retry: { maxRetries: 1, waitDurationMs: 2000, backoffScaling: 2 },
  },
  async function classifyAsk(sessionId: string, topic: string) { /* ... */ },
);

The classifier runs on a small instance with a tight timeout. The synthesizer gets a larger instance with more time. Each stage fails independently — with its own logs, latency metrics, and replay options in the Render dashboard. No custom queuing infrastructure required.

The reverse webSocket tunnel

The voice_session task opens two WebSocket connections:

One to AssemblyAI for the voice agent
One back to the Web service broker

The broker pairs the browser's WebSocket with the task's WebSocket using the session ID and token, creating a two-way audio tunnel:

browser ↔ broker ↔ workflow task ↔ AssemblyAI

When AssemblyAI emits a tool.call, the task handles it directly — no extra round-trip to the browser. For research requests, the task launches the research chain, waits for a briefing, then feeds the result back to AssemblyAI for voice synthesis.

Shape-aware research with Mastra

Not all questions are the same. "Tell me about bonobos" needs prose with citations. "List every tribe in the Bible" needs a complete enumeration. Running a fixed-strategy pipeline for both is wasteful.

The first stage uses a lightweight Mastra agent to classify the user's question into one of five shapes: narrative, enumeration, comparison, specific, or recent.

const result = await classifierAgent.generate([
  { role: "user", content: `Classify this ask: ${topic}` },
]);
const { shape } = JSON.parse(result.text);

The planner uses the shape to determine how many search queries to run — up to 40 for enumeration, fewer for narrative. The synthesizer then uses a shape-specific system prompt: enumeration gets "return a complete list with one item per line", narrative gets "weave sources into prose with inline citations".

Enforcing a hard 60-second deadline

Long wait times kill demos. The research chain uses a racePartial helper that collects search results until an AbortSignal.timeout fires, then proceeds with whatever it has:

const results = await racePartial(
  queries.map((q) => searchBranch(sessionId, q)),
  Math.max(0, deadline - Date.now() - 12_000),
);

If 7 out of 12 search branches complete before the deadline, the synthesizer proceeds with those 7 sources. Waiting for the stragglers would tank the user experience. A slightly incomplete briefing in 60 seconds beats a perfect briefing that never arrives.

Verification and retry

Before the briefing reaches text-to-speech, a verifier agent checks it against the original request:

const { passes, feedback } = await verify({ topic, shape, briefing });

if (!passes) {
  const retryQueries = await planQueries({ topic, shape, feedback });
  // one more search + synth pass, deadline permitting
}

The system allows one retry. If the verifier still isn't satisfied, the briefing ships as a degraded answer. A less-than-perfect response in 60 seconds is better than silence.

Real-time progress streaming

The Web service publishes phase events via SSE as tasks run:

events.publish({ sessionId, kind: "ask.classified", shape, at: Date.now() });
events.publish({ sessionId, kind: "plan.ready", queries, at: Date.now() });
events.publish({ sessionId, kind: "youcom.call.started", tier, at: Date.now() });

The frontend subscribes to /api/sessions/:id/events and updates a live activity feed — no polling required.

Handling concurrent sessions

To keep the demo stable at scale, a few guardrails are in place:

Concurrency cap — POST /api/start returns 503 AT_CAPACITY when 100 active sessions are running
Session TTL — each session expires 15 minutes after creation
Cleanup daemon — a loop runs every 60 seconds, cancels expired workflow tasks via render.workflows.cancelTaskRun(), and marks sessions closed in Postgres
Token-based routing — / redirects to /s/{token}, so a browser reload resumes the existing session instead of consuming new capacity

Deploying on Render

The repo includes a render.yaml Blueprint for the Web service and Postgres database. For the Workflow service, you'll create it separately:

Use the Deploy to Render button in the GitHub README to provision the Web service and database
Create a new Workflow service in the dashboard, pointing to the same repo with start command node dist/render/tasks/index.js
Set the required environment variables on both services: ASSEMBLYAI_API_KEY, ANTHROPIC_API_KEY, YOU_API_KEY, and RENDER_API_KEY

The preDeployCommand runs npm run migrate automatically on every deploy.

What you can take from this

This architecture is worth adapting for any voice application that needs to do real work in the background. A few patterns that transfer well:

Separate voice from orchestration. Keep the voice channel in its own task. Don't block it with tool execution.
Use workflow tasks for each pipeline stage. Independent timeouts, retries, and logs make debugging much easier than a monolithic pipeline.
Shape-aware planning improves output quality. Classifying the request before searching means your synthesizer gets the right prompting for the right kind of answer.
Hard deadlines with partial results beat open-ended waiting. Users will forgive an incomplete answer. They won't forgive silence.

👉 Check out the full source on GitHub and the AssemblyAI Voice Agent docs to get started.