The single most expensive bug at every company I've worked at was the same bug.

It's 4pm on a Wednesday. The CEO is on a call with an investor. The investor asks how many active users the company has. The CEO confidently quotes a number from the dashboard. Two hours later, the head of product mentions a different number in a different meeting. The next morning, the CFO sends a third number in the board update.

Nobody is lying. Each of them is reading a real, correct number from a real, correct dashboard. They just don't agree on what "active user" means, and the data infrastructure has no way to make them.

This is the problem we set out to fix when we started Matriq. Not "make SQL easier." Not "build ChatGPT for databases." The thing we wanted to fix is: every team in every company is quietly running on three different definitions of the same word, and the data tools are fine with it.

This post is the story of how we tried to fix that, what didn't work, and what eventually did.

Attempt one: just use a semantic layer

The first thing anyone tells you when you describe this problem is "you need a semantic layer." And they're not wrong, kind of.

A semantic layer is a YAML file (or a fancier UI sitting on top of one) where you write down, once, what every business term means. active_user is defined here. revenue is defined there. Every dashboard pulls from the layer instead of writing its own SQL. In theory, everyone agrees on definitions because the definitions live in one place.

In practice, three things go wrong.

First, the YAML file becomes its own load-bearing thing. Now you need someone who understands the semantic layer well enough to update it. That's a specialized skill. The person who understood the spreadsheet probably can't update the YAML. So updates go through a bottleneck — the data engineer — and the bottleneck is exactly what you were trying to remove.

Second, the file rots. The week after launch, marketing decides "active user" should exclude internal accounts. They mention this in a meeting. They don't update the YAML, because updating the YAML requires a pull request and they don't write pull requests. Six months later, the YAML and the conversational definition have drifted apart, and the dashboards built on the YAML are quietly correct-by-the-old-rules and wrong-by-the-new-rules.

Third — and this is the one that actually killed it for us — the file doesn't know what it doesn't know. If a definition is missing, the semantic layer can't tell you. It can only tell you about the things someone already wrote down. Most of the actual business knowledge is in nobody's head individually and in every Slack thread collectively.

We tried for about three weeks to make a semantic layer work as the primary memory of the agent. It didn't. We could feel ourselves rebuilding dbt with extra steps.

What we wanted instead

The brief I kept writing on whiteboards was something like:

The agent should learn what your business means by being used. Not by being configured. Not by being trained. By being used.

This is harder than it sounds, and we got it wrong twice before getting it sort of right.

The constraints we set ourselves:

These constraints sound mild on a slide. Together, they ruled out about 80% of the easy designs.

Attempt two: stuff it all in a vector store

The obvious next move, when you're sitting in 2026 and someone says "memory," is to throw everything into a vector database. Embed every conversation, every correction, every saved query, retrieve the top-k by cosine similarity at query time, and let the model figure it out.

We built this. It mostly worked. It also did three things that were quietly disqualifying.

It conflated definitions with examples. The vector store would happily retrieve a snippet that said "active user = logged in this month" and another that said "show me active users by region" with similar relevance. The model would sometimes pick up the example phrasing and treat it as the definition. The corrections would land somewhere, but unpredictably.

It had no time semantics. A definition from March and a contradicting definition from October would both come back from the same query, ranked by similarity. There was no way for the system to know "the October one is the current one, the March one is historical." Cosine similarity is a great measure of "related-ness." It's a terrible measure of "trueness right now."

It leaked across users. Embeddings are flat. When a user from one workspace happened to phrase a question similarly to a user from another workspace, a poorly-scoped retrieval could pull from the wrong tenant. This is a security category, not a quality category, and it was the real reason we ripped this approach out.

// Try Matriq

Stop the Monday-morning fire drills

Matriq is an AI data analyst that connects to your database in ~6 minutes, learns your business definitions, and self-heals reports when your schema changes.

Attempt three: a typed memory layer with three categories

The version we have now is unsexy but it works. We store memories in three explicit categories — not implicitly via similarity, but explicitly via type — and each type has its own retrieval rules.

Category 1: Definitions. A definition is a structured object: a term, a current meaning, a list of historical meanings with date ranges, and the user who set the current meaning. When the agent sees a question containing a known term, it loads the current definition deterministically (not by similarity), and includes the historical ones in the context only if the question has a time qualifier ("how many active users did we have in February").

Category 2: Conventions. A convention is a softer rule. "Always exclude internal email domains." "Treat the EU and UK as separate regions." "When someone asks about revenue, default to ARR not MRR." These don't have a single cleanest target like a definition does, but they shape every answer the agent gives. Conventions are scoped — workspace-wide, team-wide, or personal — and they compose: a personal convention can override a team one, a team convention can override a workspace one. We found this matters more than we expected, because different teams in the same company really do want different defaults.

Category 3: Examples. Concrete past Q&A pairs. These are the only ones we still embed for similarity retrieval. They're useful as few-shot context for the model, but they are explicitly labeled as examples — the model knows it should not treat an example as authoritative. This sounds like a small thing. It's not. It's the thing that fixed the "model picks up phrasing from a sample query and thinks it's a definition" bug from attempt two.

The combination of these three — deterministic for definitions, scoped-and-composed for conventions, similarity-retrieved for examples — is what we currently mean when we say "Matriq Memory." It is much less magical than it sounds in the marketing copy and much more useful than the magical version was.

What we got wrong on the way

Two honest misses, because every "how we built it" post that doesn't have these is hiding something.

Miss 1: We optimized for first-day delight before reliability. Early on, we prioritized making the agent feel smart on the first query. That meant aggressive auto-inferring of definitions when the user hadn't taught us anything yet. This was great for demos and bad for users, because the auto-inferred definition would silently anchor everything that came after. We learned to make the agent ask on the first query and assume only after explicit confirmation. Demos got slightly worse. Reliability got dramatically better. The right trade.

Miss 2: We built memory before we built memory deletion. It is shockingly easy to add a feature that lets users "teach" the agent something. It is shockingly hard to add the feature that lets users un-teach it cleanly, especially when the wrong teaching has already cascaded through saved reports, scheduled jobs, and other users' answers. We had to retrofit a deletion model six weeks in, and it would have been a third as much work to design it from the start.

If you're building something in this space, design the deletion path first. Future-you will send past-you a thank-you note.

Why this matters for you

Most of you reading this aren't going to build an agent. You're going to use one. So the only thing you actually need from this post is a way to evaluate the agents you're considering.

Here are the three questions I'd ask any AI data tool:

  1. When I correct you, where does the correction live? If the answer is "in this chat," that's a chatbot. If the answer is "in a config file you have to update," that's a semantic layer. If the answer is "in a typed memory that's loaded automatically the next time anyone asks," that's an agent.
  2. If the meaning of a term changed last month, what does a query from two months ago see? If the answer is "the new meaning, oops" — be careful. That tool is going to silently break your historical reporting.
  3. Can you delete a memory cleanly? If the answer is hand-wavy, you're going to live with the consequences of your first month's worth of corrections forever.

You can try Matriq here — we onboard small teams personally and the setup is ~6 minutes. You can also see the memory before/after on our homepage if you want a concrete picture before talking to a human.

We built this because the bug at the top of this post — the CEO, the head of product, and the CFO each quoting a different "active user" number — is a bug I have personally caused at least four times. It is solvable. It just isn't solvable with a chatbot.