How to create an AI cold-calling agent
Build an AI cold-calling agent that books meetings, handles objections, and sounds human. Architecture, compliance, and the speech-to-text latency that makes or breaks it.



An AI cold-calling agent is an outbound Voice AI system that places calls, opens the conversation, pitches, handles objections, and either books a meeting or disqualifies the lead — without a human on the line. Built right, it runs 500 calls in parallel at roughly the cost of a single SDR.
Built wrong, it sounds like a telemarketer with a bad connection and gets hung up on in four seconds.
This guide walks through how to actually create one: the architecture, the speech-to-text accuracy you need for objection handling to work, the compliance traps (TCPA, state-level consent), and the pieces that decide whether your agent books meetings or ends up on a "do not call" list.
We'll anchor the stack on the AssemblyAI Universal-3 Pro Streaming model — 307ms P50 latency, native mulaw, and the alphanumeric accuracy that matters when the prospect rattles off their email.
What is an AI cold-calling agent?
An AI cold-calling agent is an outbound Voice AI system that dials a prospect, delivers a pitch in natural conversation, adapts in real time based on what the prospect says, and books qualified meetings or gathers disposition data. Unlike a robocall (one-way recorded message) or a dialer with a human rep, it conducts a two-way conversation autonomously.
The typical jobs an AI cold-calling agent does:
- Outbound SDR prospecting: open with a relevant hook, qualify BANT, book a demo
- Appointment setting for field sales, financial advisors, home services
- Re-engagement of lapsed leads in a CRM
- Survey and research calls at scale
- Event follow-up and RSVP confirmation
- Renewal and upsell motions for existing customers
The common thread: one script, thousands of conversations, measurable booking rate.
The architecture of an AI cold-calling agent
An AI cold-calling agent is a phone-based voice agent with a few extra components tuned for outbound. Here's the full stack:
<pre>
CRM / lead list (Salesforce, HubSpot, CSV)
│
▼
Dialer / orchestrator
(concurrency, pacing, DNC check, retries)
│
▼
Twilio / SIP outbound call
│ WebSocket bridge
▼
Universal-3 Pro Streaming (STT)
│ transcript
▼
LLM with sales prompt + objection map
│ text response + tool calls
▼
TTS (ElevenLabs / Cartesia)
│ audio
▼
Prospect
│
└─► Call disposition
└─► CRM update + calendar booking
</pre>The five components that matter:
- Lead source and dialer — where the list comes from and how you pace calls
- Telephony — Twilio, SIP, or a managed voice agent platform
- Streaming speech-to-text — the ears; must hear objections the moment they start
- LLM with a sales-specific prompt — opener, discovery, objection handling, booking logic
- Text-to-speech — the voice; naturalness matters more here than on inbound
Plus two things that are unique to outbound: compliance filtering (TCPA, state consent laws, DNC registries) and post-call disposition sync back to the CRM.
Why speech-to-text accuracy decides whether an AI cold-calling agent works
On an inbound support call, the caller wants help — they'll repeat themselves if you miss something. On an outbound cold call, the prospect is deciding whether to hang up in the first five seconds. If your agent mishears "not interested, take me off the list" as "I'm interested, tell me more," you don't get a second chance.
Three STT capabilities decide the quality of an AI cold-calling agent:
Low, stable latency
Natural turn-taking happens in under 800ms end-to-end. Any longer and the prospect thinks they lost connection — or worse, that they're on a robocall. The Universal-3 Pro Streaming model delivers 307ms median latency with immutable transcripts, which lets your LLM start generating a response before the prospect even finishes their sentence.
Alphanumeric accuracy
Cold calls capture emails, phone numbers, company names, and job titles. "J at acme dot io," "director of rev ops," "five one five, nine eight two, four zero zero zero." Universal-3 Pro Streaming delivers 21% fewer alphanumeric errors and 28% better accuracy on consecutive numbers than the previous generation — the difference between a booked meeting in your calendar and a typo you never catch.
Intelligent endpointing
Prospects pause. "I'm… probably not the right person to talk to about this." If your agent jumps in at the first pause, it interrupts. If it uses a fixed silence timer, it feels robotic. Intelligent endpointing combines acoustic and semantic signals to detect real turn boundaries — the difference between a thoughtful agent and an impatient one.
Building the conversation logic
The LLM prompt is where an AI cold-calling agent earns its meetings or wastes the prospect's time. A good cold-calling prompt has four sections:
1. Identity and opener
Who the agent is, which company it represents, why it's calling. Must include clear AI disclosure in the opener — this is both good practice and legally required in several states (California, Florida, Texas among others).
2. Discovery questions
Two to four questions that qualify or disqualify the prospect. Don't ask five — you'll get hung up on.
3. Objection handling map
A structured map of likely objections and how to respond. The usual suspects:
- "How did you get my number?"
- "Send me an email instead."
- "I'm not the right person."
- "We already use [competitor]."
- "We're not interested."
- "Take me off your list."
That last one is the most important. If the prospect says anything that sounds like a do-not-call request, the agent must immediately:
- Acknowledge
- Confirm the number will be added to DNC
- End the call politely
- Flag the number in your CRM and DNC database
No upselling. No "can I just ask one question?" You don't get a second chance on a compliance complaint.
4. Booking logic
If the prospect qualifies and is interested, the agent needs to book — not hand off. That means live calendar access via tool call, a handful of proposed times, and confirmation sent over SMS or email during the call.
Picking the telephony layer
Three options depending on your volume and how much you want to operate yourself:
Whatever you pick, the audio path is the same: 8kHz mulaw in and out. Use a speech-to-text model that accepts mulaw natively — resampling to 16kHz PCM adds round-trip latency you can't afford on a cold call.
The outbound-specific components
Dialer and pacing
You can't just fire off 10,000 calls at once. Telco carriers flag high-volume outbound as spam within minutes, and your numbers get blocked. Real dialers pace calls, rotate outbound numbers, and respect time-of-day rules (TCPA restricts calls before 8am and after 9pm in the recipient's local time).
If you're using Twilio, you'll want a local presence strategy — matching the outbound caller ID to the area code of the number being dialed. Connection rates go up meaningfully.
Compliance filtering
Before any call goes out:
- Scrub against the federal Do Not Call registry
- Scrub against state DNC lists (several states maintain their own)
- Scrub against your internal suppression list (previous DNC requests, unsubscribes)
- Verify you have a valid purpose under TCPA for B2C calls, or a legitimate business interest for B2B
- For calls into EU numbers, confirm GDPR lawful basis
Build this filtering as a hard gate — no call goes out if any check fails. The fines for TCPA violations are $500–$1,500 per call.
Call recording and PII redaction
Record every call for quality and compliance. Store recordings encrypted. If you're recording in a two-party consent state (California, Florida, Pennsylvania, and others), the agent must get consent at the top of the call.
Use PII redaction on transcripts before they hit your CRM or analytics warehouse. Cold calls pick up personal data you often don't need to retain.
CRM sync and disposition
Every call ends with a disposition: booked, callback, not interested, DNC, voicemail, no answer, wrong number. That disposition has to land in the CRM within seconds, along with the transcript, recording URL, and any tool calls the agent made (calendar event IDs, follow-up email queued, etc.).
This is where most AI cold-calling agent projects leak value. Great calls, terrible data hygiene, nothing tracked, impossible to iterate on.
Minimal implementation sketch
Here's the shape of an AI cold-calling agent built on Twilio + AssemblyAI Universal-3 Pro Streaming + your LLM and TTS of choice. This is the outbound-specific piece — it assumes you already have the inbound WebSocket bridge from a standard phone-based voice agent tutorial.
<pre><code class="language-python">from twilio.rest import Client
import os
twilio = Client(
os.environ["TWILIO_SID"],
os.environ["TWILIO_AUTH"],
)
def place_cold_call(prospect):
# 1. Compliance gate — no call without a clean scrub
if is_on_dnc(prospect.phone) or is_suppressed(prospect.phone):
log_skipped(prospect, reason="dnc")
return
# 2. Pick a local-presence outbound number
from_number = pick_local_number(prospect.phone)
# 3. Open the call — TwiML handoff to our media stream handler
call = twilio.calls.create(
to=prospect.phone,
from_=from_number,
url=f"https://your-server.app/voice-agent/start?lead_id={prospect.id}",
record=True,
recording_status_callback="https://your-server.app/recording-done",
machine_detection="Enable", # detect voicemail, don't pitch a robot
time_limit=600, # cap at 10 min
)
return call.sid
</code></pre>
Two things worth calling out:
- machine_detection="Enable" — Twilio tells you when the call hit a voicemail. Your agent should either leave a short pre-recorded voicemail (compliant, clear AI disclosure) or hang up. Don't pitch a recording machine.
- time_limit=600 — cap call duration. Runaway LLM loops on a long call are a common failure mode; a hard cap prevents runaway cost and angry prospects.
The inbound audio path (WebSocket → Universal-3 Pro Streaming → LLM → TTS → back to Twilio) is identical to any other phone-based voice agent. The outbound piece is the dialer, the compliance gate, and the disposition logic.
For a full runnable implementation — dialer.py with compliance gating, server.py with the four sales tools (book_meeting, mark_callback, mark_not_interested, honor_dnc), and automatic disposition writing — clone the companion repo:
git clone https://github.com/kelsey-aai/ai-cold-calling-agentThe repo ships with a sample leads.csv, a stubbed compliance layer, and a --dry-run mode so you can verify the pipeline before dialing a real number.
Measuring an AI cold-calling agent
A cold-calling agent lives or dies by four numbers:
The end-to-end number — meetings booked per 1,000 dials — is what determines whether the agent is ROI-positive. Track each stage independently so you know where to iterate.
Two qualitative signals also matter:
- Transcript read-through: spend an hour a week reading transcripts. You'll find LLM failures you never catch in aggregate metrics.
- Prospect complaints: any complaint is a leading indicator of a future regulatory issue. Take them seriously, even when "only one."
Conversation intelligence on your call corpus is the fastest way to spot which prompt changes actually moved book rate vs. which just changed the vibe.
Compliance: the part most teams underweight
The single fastest way to kill an AI cold-calling agent program is a TCPA class action. A few non-negotiables:
- Scrub DNC before every call, not just at list ingest
- Disclose AI clearly in the opener (several states now require it; California SB 243 and others are tightening)
- Honor "take me off the list" immediately and permanently
- Respect state-level outbound calling windows — TCPA's federal baseline is 8am–9pm local time, but several states are stricter
- Record and retain evidence of consent for any B2C call
- Don't spoof caller ID — use owned numbers with a local presence strategy, not fake ones
When in doubt, B2B calls to work phone numbers generally have more latitude than B2C calls to mobiles. Still, assume every call is a compliance event and log accordingly.
AI cold-calling agent vs. AI SDR vs. traditional dialer
AI cold-calling agents don't replace human SDRs at the top of the market. They replace the bottom half of the dial list — the part a human SDR would never get to — and scale qualification in a way email cadences can't.
Closing thoughts
An AI cold-calling agent is a phone-based voice agent with a sales prompt, a dialer, and a compliance layer strapped on. The hard part isn't the LLM or the TTS — it's the speech-to-text layer that decides whether the agent hears objections accurately enough to respond well, and the operational layer that keeps you out of TCPA trouble.
Don't ship one without reading your own transcripts. Don't ship one without DNC scrubbing. Don't ship one with a speech-to-text model that was trained on podcast audio, not phone audio.
The fastest way to find out if an AI cold-calling agent will work for your motion is to build a small one against 500 leads, read every transcript, and measure the book rate. Universal-3 Pro Streaming is the reference streaming speech-to-text layer we'd recommend starting with — low latency, accurate on phone audio, unlimited concurrency, and $0.15/hour. The companion GitHub repo at github.com/kelsey-aai/ai-cold-calling-agent is the full working implementation — fork it, drop in your lead list, and run python dialer.py --dry-run first.
Frequently asked questions
What is an AI cold-calling agent?
An AI cold-calling agent is an outbound Voice AI system that places phone calls, conducts a natural spoken conversation with the prospect using a streaming speech-to-text model and a Large Language Model, handles objections, and either books a qualified meeting or marks the lead as not interested — without a human on the line. It's different from a robocall because it holds a real two-way conversation, and different from an AI SDR email tool because it works over the phone.
How does an AI cold-calling agent work?
An AI cold-calling agent works by dialing a prospect through a telephony provider like Twilio, streaming the prospect's voice into a real-time speech-to-text model, passing transcripts to an LLM that follows a sales prompt with objection-handling logic, and speaking replies back through a text-to-speech model. The full loop runs in under 800ms per turn, which is what makes the conversation feel natural instead of robotic.
What is the best speech-to-text for an AI cold-calling agent?
The best speech-to-text for an AI cold-calling agent is a streaming model with sub-300ms latency, native 8kHz mulaw support, and high accuracy on alphanumerics like emails and phone numbers. AssemblyAI's Universal-3 Pro Streaming model is purpose-built for voice agents, with 307ms median latency, immutable transcripts, intelligent endpointing, and 21% fewer alphanumeric errors than the previous streaming generation.
Is it legal to use an AI cold-calling agent?
Using an AI cold-calling agent is legal in most jurisdictions when you follow TCPA requirements in the US, GDPR in the EU, and state-level rules — meaning you scrub the federal and state Do Not Call registries before every call, disclose that the caller is an AI (required in California, Florida, Texas, and a growing list of states), honor opt-out requests immediately, and respect calling-hour windows. B2B calls to work numbers generally have more latitude than B2C calls to mobiles, but compliance filtering should be a hard gate regardless.
How much does it cost to run an AI cold-calling agent?
An AI cold-calling agent typically costs between $0.50 and $2.00 per conversation end-to-end at scale. The components are telephony (Twilio per-minute outbound voice), streaming speech-to-text (AssemblyAI Universal-3 Pro Streaming is $0.15/hour of session time), the LLM (varies by model and tokens), and text-to-speech (per-character or per-minute). At 10,000 calls/month the economics are roughly one-tenth the cost of an equivalent human SDR seat.
How do I build an AI cold-calling agent?
To build an AI cold-calling agent, combine a telephony provider (Twilio Voice, SIP, or a managed platform like Vapi or Retell) with a streaming speech-to-text model like Universal-3 Pro Streaming, an LLM with a cold-calling prompt that includes opener, discovery, objection handling, and booking logic, and a text-to-speech model. Wrap it with a dialer that enforces DNC scrubbing, calling-hour rules, and CRM disposition sync — those operational pieces are what separate a working program from a compliance incident.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.


