Shaping Hermes: a private, local AI assistant

An RTX 3060 found through our own second-hand price tracker, picked up in Elche, and given a new job: our own AI assistant running entirely on-prem -- no cloud, no API bills, with access to our infrastructure and Odoo, and a web chat the whole team can use.

What if the assistant lived in the rack, not in the cloud?

We use agentic AI every day to build and run our systems. But the most capable assistants live in someone else's cloud, behind an API meter, and they see whatever we send them. For our own infrastructure and our clients' data, we wanted something different: an assistant that runs entirely on our own hardware, sees only what we let it see, and costs nothing per token. We called it Hermes.

This is the story of giving Hermes a shape: a graphics card with a backstory, a container, a local model, and -- the part that actually matters -- a set of hands.

A GPU, a container, and a driver dance

The card itself has a backstory. Earlier this week we shipped ltc_second_hand, a price tracker for the resale market, and it flagged a well-priced RTX 3060 in Elche. We drove down, brought it home, and gave it to Hermes. Our own tool finding our own hardware.

Hermes lives in an unprivileged container on pve2, the node we rebuilt on Proxmox 9 last week. The container has no special privileges, but it does have a passed-through NVIDIA RTX 3060 with 12 GB of memory. Passing a consumer GPU into an unprivileged container is fiddly: the host and the container must run the exact same driver version, and the kernel assigns some device numbers dynamically, so a reboot can silently break access. We pinned the driver on both sides and added a small service that re-syncs those device numbers before any container starts, so a power cut never leaves Hermes blind.

The brain: a local model that can call tools

For the model we run Ollama, serving a 14-billion-parameter model that fits comfortably in 12 GB. The headline feature is not how it writes -- it is that it can reliably emit structured tool calls. A model that only chats is a curiosity; a model that can decide to run a command and act on the result is an assistant. We tried several candidates and learned the hard way that some of them leak their tool calls into plain text instead of actually calling the tool, which makes them useless for automation. The one we kept does it cleanly.

It is not a frontier model, and we are honest about that: it is slower and less sharp than the cloud assistants we use elsewhere. But it is ours, it is private, and for day-to-day operations it is more than enough.

Giving Hermes hands

A local brain is only useful if it can reach our world. We connected Hermes to three sets of tools through one small, uniform protocol. The first is memory: a search index over years of our own notes and decisions, so Hermes can answer "how did we do this last time?" from our actual history. The second is systems: the ability to run commands across our cluster. The third is Odoo: reading and writing records in our two production databases. We learned that small local models are far more reliable with tightly-scoped, minimal tools than with one big do-everything command, so each tool does one thing and describes it plainly.

A chat for the whole team

Living in a terminal is fine for us, but an assistant the rest of the team can use needs a front door. We put a multi-user web chat in front of Hermes: anyone on the local network gets their own account and history, the dangerous tools are restricted to administrators, and the whole thing never leaves our network. The search-our-notes tool is open to everyone; the keys to the infrastructure are not.

What we learned

A private assistant is a series of trade-offs, and naming them is the point. You trade raw capability for privacy and zero marginal cost. You trade a polished cloud product for a stack you understand end to end. And you discover that the hard part was never the model -- it was the plumbing: the driver, the device numbers, the tool boundaries, the access control. Hermes will not replace the frontier models we reach for on the hardest problems. But for the steady stream of "check this, summarise that, run that on the cluster" work, it sits quietly in the rack, asks for nothing, and tells no one.

host1 is now pve2: crossing to Proxmox 9 without dropping a container
host1 came back last month; this week it was reborn as pve2 on Proxmox 9.2. The cluster crossed a major Debian version live, eleven containers never noticed, and a two-node quorum trap got defused before it could bite.