
Engineering
When building AI chat is actually hard (how and why we built our agents)
Anh-Tho Chuong • 6 min read
Feb 26
/6 min read

We shipped our first AI features in late 2025, long after many other companies who built RAG chatbots starting in early 2023. That sounds late, but I don't think it is. It's a product of how our category constrains us, how we think about choose what (not) to build and how what looks easy is often hard (and vice-versa).
Let's start with the first part:
I'm super thoughtful about what we built. I've seen enough features rushed into the market to jump on a trend (remember NFTs?) and decided we wouldn't fall into this trap.
Just for context, we split our AI features into three distinct assistants, two of which are currently live.
Many in-product chatbots aren't very useful. They do the same thing as GPT/Claude with web search (read documentation and answer the question), but without the UX.
This is why I asked myself a question more product leaders should ask: Do we need to be building this? Or can users get the same value elsewhere?
This is especially important because you don't just build once and let it run. Factoring in the opportunity cost, ongoing maintenance and token costs can make building the wrong feature much more expensive than wasted engineering time.
But Lago has proprietary data: Usage events, revenue, customers, entities… the kind of stuff you'd be very worried if ChatGPT found with web search. We realized our customers couldn't automate billing workflows with AI if we didn't build it.
But for this to become useful, we wanted to build true agents, not just a chatbot that gets you hopefully-correct data you could've found in two clicks.
This made a big difference because building an agent that operates APIs is harder than building a chatbot, especially in billing. It requires permission systems, confirmation flows, audit logs and safeguards.
Our product is the financial backbone of our customers. It directly touches accounting, compliance and security, which means we can't operate the way many startups do: only fix edge cases once someone complains, ignore known bugs if they don't happen often enough.
We need to get things just right, not directionally correct. This extends to our AI agents. You don't want a billing agent accidentally refunding your biggest customer.
This is why we waited:
So let's dive in why it was so hard:
When people say AI chat is easy to build, they're talking about systems like this: A generic document chatbot.

It extracts data relevant to the prompt and uses it to enrich the response. But this isn't what we built.
The system behind our billing assistant is a three-layer stack:

This is where (as you can tell) things got challenging. We had to:
And these are just the obvious things! Add to this the fact that everything is higher-stakes in billing.
For example, we needed to ensure the agents respected RBAC (role-based access control). You don't want your finance intern with "view only" access handing out discounts via chat. So our AI agents need to check permissions before every action. We also needed to ensure customers who used multiple entities got the correct results. There were a variety of these “little” things we had to get right.
But the biggest one was AI's biggest issue:
In most products, an AI hallucination is an inconvenience. In billing, it's a financial incident that harms trust and loses money. An agent that can void invoices, retry payments and apply discounts needs to get it right.
That's why we treat hallucination prevention in layers.
First, our Mistral agent operates under a detailed system prompt that constrains it to only use tools we've explicitly defined. It can't try, adapt and retry the API an OpenClaw instance might. This means it'll require slightly more user hand-holding, but also minimizes catastrophic result.
Second, we built guardrails before any consequential operation (create, update, delete, void, retry, refresh). In this, the agent must show the user a preview of exactly what it's about to do and wait for an explicit "yes." There's no “always allow” and you can't turn this off.
We've also intentionally not built some tools. Organization management, API key settings, webhook setup and similar things can only be done manually.
The prompt also went through many, many iterations. We ran thousands of queries to stress-test it, running thousands to find gaps. Some versions were too permissive (allowed things they shouldn't) or too long (ran out of context too quickly).
One of our biggest learnings in this is that even though AI is extremely powerful, you might not want to enable users to do whatever they want to do.
One of the weirder decisions we made was building 3 AI-powered chat interfaces, not one. We made this choice for a few reasons:
Lago is cross-functional. It's used by engineers, product/growth, finance and ops people. Depending on who uses it, the output they want is different. Finance wants data, or to find a specific invoice. Product/growth cares more about shipping their new pricing experiment quickly.
This is why selecting the right assistant already influences the outputs.

Imagine a product leader asks "what if we raised all prices by 20%?" If they're talking to the pricing assistant, they get strategic advice. If they're talking to a general-purpose agent with billing access, it might interpret this as an instruction to raise prices and execute the change.
By separating assistants, we create guardrails. The billing assistant can execute actions but only within billing. The finance assistant can query but can't modify. The pricing assistant only advises.
It's easy to talk about what went well and how we solved things, but a lot went wrong in the process. Let's explain a few here.
The initial motivation wasn't a customer request, so scoping was hard. Billing is infrastructure that's often bought with a checklist. Customers evaluate billing systems in large part based on the exact features you support. That means many features we build directly come from customers/prospects and mean we already know the spec.
Because we didn't start from a specific workflow, the scope kept expanding. We began with a handful of invoice tools, then kept adding. Today we have 52 tools. That's a lot.
Every tool you add makes the agent harder to control, requiring more precise instructions.
Looking back, I'd start working with customers and see what workflows take the longest time, build tools for those and soft-launch it to those design partners.
Prompt engineering was its own project. Since our team is very engineer-driven, we expected the technical part to be the most difficult. But writing the system prompt was hard. Early versions had security gaps where the agent would sometimes execute actions without waiting for confirmation. Other versions were so detailed they burned through the context window before the conversation got anywhere useful.
None of these mistakes were fatal. But they cost us time and focus, and I think being honest about them is more useful than pretending the process was smooth.