How we built Maria: our AI VoIP agent in one day

SIP trunk, LiveKit self-hosted, Deepgram, Claude and ElevenLabs. One day, one phone number, one AI that answers.

We wanted a customer to be able to call us and get immediate help, even outside office hours. Today we did it. In one day we took a phone number, a trunk, a SIP stack, a few AI APIs, and we stitched together an agent that can hold a real conversation, take the call anywhere in the world, and hand us back a structured summary. Her name is Maria.

Chat screenshot with project answers

Real chat snippet from the day we scoped Maria.

The challenge

We did not want "a bot". We wanted something a customer would actually keep talking to. The requirements were honest and unforgiving:

Bilingual Spanish/English, with natural prosody, not a robot voice.
GDPR-compliant: the caller must be told they are talking to an AI, calls must have a retention policy, and personal data must land in a place we control.
Low latency. Under a second of perceived delay or the illusion collapses.
Cost-aware. If each minute costs more than a human receptionist, we have built an expensive toy, not a product.
CRM-native. Every call should arrive in our Odoo as a lead, with transcript and summary, ready for a human to follow up.

Stack and decisions

We built on open source where we could, and on paid APIs where quality still matters more than code. The architecture:

PSTN caller
    |
    v
[ Zadarma SIP trunk ]   DID +34 868 35 37 57
    |
    v                   (UDP 5060 + RTP 10000-10999)
[ Home router NAT ]
    |
    v
[ LXC CT-140 on Proxmox ]
    |
    +-- livekit-sip 1.2.0 ---+
    |                        |
    +-- livekit-server 1.11  |---> WebRTC room
    |                        |
    +-- Redis 7 -------------+
    |
    +-- Python agent (livekit-agents 1.5.4)
            |
            +-- STT : Deepgram Nova-3 (multilang, keyterm boost)
            +-- LLM : Claude Haiku 4.5 (turn) + Sonnet 4.6 (summary)
            +-- TTS : ElevenLabs Flash v2.5 (Spanish voice)
            |
            v
       Odoo CRM (crm.lead + transcript + audio)

Why those pieces:

SIP trunk: Zadarma. Cheap, real Spanish DIDs, and a proper API. Paying for a full-fat provider for an R&D line made no sense.
Telephony cloud: LiveKit self-hosted. livekit-server 1.11.0 plus livekit-sip 1.2.0, built from source. It gives us the SIP gateway, the WebRTC bridge and the agent runtime in a single coherent surface, and it runs inside our own LXC.
STT: Deepgram Nova-3. Multilingual out of the box, and crucially it supports keyterm boosting, so domain acronyms like OCA, ERP or ICT stop being transcribed as random nouns.
LLM: Claude Haiku 4.5 for every conversational turn, Sonnet 4.6 for the post-call summary. Haiku is fast enough for real-time dialogue, Sonnet gets the heavier lifting only once per call, when nobody is waiting.
TTS: ElevenLabs Flash v2.5. Spanish voice uQw4jpKzMLrZuo0RLPS9. Low latency, and it does not sound synthetic by the third sentence.
Infra: Proxmox LXC, Python 3.11, Redis. No Kubernetes, no managed services. We know how to fix everything we run.

The three NAT/Zadarma gotchas

If you try to put livekit-sip behind a consumer NAT with a Zadarma trunk, these three things will eat your afternoon. Writing them down so the next engineer does not lose the same hours.

status=486 reason="flood" does not mean you got rate-limited. In livekit-sip source, "flood" is a generic label for "no trunk matched this INVITE and the policy is Drop". If you create a trunk with numbers: ["+34868353757"] but Zadarma forwards the call with SIP URI sip:ai@tu_ip_wan:5060, the To-user is ai, not the phone number, and nothing matches. Fix: filter the trunk by source IP range (Zadarma's SIP edge lives in 185.45.152.0/22), not by number.
NAT-1-to-1 is not optional behind a home router. livekit-sip must advertise the public IP in the SIP Contact header and in the SDP media address. Otherwise Zadarma sends the ACK to your LAN IP, the ACK never reaches you, livekit-sip logs "Call accepted, but no ACK received" and kills the session after 10 seconds. The caller only hears ringtone. Fix:
```
nat_1_to_1_ip: tu_ip_wan
media_nat_1_to_1_ip: tu_ip_wan
        
```
Do not combine with use_external_ip: true - they are mutually exclusive.
Do not rely on the To-user to identify the trunk. Zadarma will keep the URI user you put in the panel, not the number the caller dialled. And Zadarma retries aggressively on any 4xx/5xx, so while you are debugging, expect a small storm of INVITEs. livekit-sip dedups them, but the logs are noisy.

GDPR and privacy

An AI picking up the phone raises obvious questions, and we answer them up front:

The first sentence Maria says is an explicit disclosure that the caller is speaking with an AI assistant, and that the call is being recorded and transcribed for service purposes.
Retention: audio and transcripts are kept for 90 days, then purged. The derived CRM lead stays, because that is a legitimate business record.
Storage lives on our own infrastructure. Third-party APIs (Deepgram, ElevenLabs, Anthropic) are used under their data processing agreements, with no model training on our traffic.
There is a human-transfer path. If the caller asks for a person, or if the model is uncertain, the call routes to Fernando's mobile.

CRM integration

This is the part we care about most, because a voice agent that does not feed the CRM is just a toy. Every call ends with:

A new crm.lead in Odoo, author tagged as "Maria", with the caller's phone number (if CLI is present) and the detected language.
The full transcript attached, turn by turn.
A short summary generated by Sonnet 4.6: what the caller wanted, what we promised, what action we owe.
A link to the raw audio, stored on our NAS and expiring in 90 days.

From the sales team's side, nothing changes. The lead appears in the same pipeline as a web form submission or an inbound email, and they follow up the same way.

Real costs, honest numbers

Everyone asks. Approximate variable cost per minute of conversation, at today's prices:

Zadarma inbound: fraction of a cent per minute on the Spanish DID.
Deepgram Nova-3: around 0.4 cents per minute of audio.
ElevenLabs Flash v2.5: a few cents per minute of generated speech (depends on talk ratio).
Anthropic Haiku 4.5: a handful of tokens per turn. A 3-minute call is well under a cent.
Sonnet 4.6 summary: one call per conversation, a few cents.

Round numbers: a 3-minute call sits comfortably under 0.15 EUR of variable cost. That is our benchmark. The fixed cost of the LXC host is negligible compared to what we were paying for the same traffic through a traditional call center.

Demo

Call us at +34 868 35 37 57 and talk to Maria. Speak to her as you would to a person. Ask her what Lemon Tree Cloud does, ask for a callback, or just test her. She will tell you she is an AI, and she will log the call in our CRM so we can follow up.

Closing

The code lives in our internal GitLab. We are not open-sourcing it yet, because half of it is still held together with very opinionated YAML. But if you want us to help you build something similar for your own business, a voice agent that actually talks to your ERP instead of screaming into Slack, get in touch. We know the pain points now.

Fernando & Claude, 2026-04-20.

Where it all started: oca_management, the missing catalog

Before you deploy, you need to know what exists. 140 repos, 7,500 addons, and one module to govern them