There is a particular kind of AI project that everyone wants to do and almost nobody does well: the internal "ask anything" knowledge assistant. The pitch is identical wherever you hear it. Connect the company's documents. Add an LLM. Now anyone can ask a question and get a synthesised, cited answer drawn from years of accumulated institutional knowledge. It will save hundreds of hours a year. It will onboard new hires faster. It will reduce the load on senior staff who are constantly being asked the same questions.
Everything in that pitch is achievable. Most implementations get a thin version of it shipped, used for a month, and quietly stop being trusted. The thing that separates the working ones from the abandoned ones is almost entirely operational discipline — what gets ingested, who can read what, when answers cite their sources, what happens when the assistant is wrong, and how you know whether it's getting better or worse over time.
This piece is a complete breakdown of the assistant we built for a 40-person professional services firm in Spain — a corporate-and-commercial law and advisory practice — with the architectural choices, the trade-offs, and the seven specific failure modes we caught and fixed before they made it to production. The firm has agreed to let us write about it on the condition that nothing identifies the client. We have lightly anonymised numbers and changed details that wouldn't affect the technical content.
This is a long piece. If you want to skim it, the architecture diagram and the seven-failure-modes section are the two things to read. The rest is justification for the choices in those two diagrams.
01 · The firmWhat they needed and why prior attempts failed
The firm has 40 people: 4 partners, 18 associates, 9 paralegals, and 9 in business operations (marketing, finance, IT, admin). Their billable rates range from €120/hr for paralegals to €620/hr for senior partners. The institutional knowledge they generate is unusually rich: research memos, client matter files, precedent libraries, regulatory updates, internal know-how documents, prior advisory engagements that contain deal structures and negotiated outcomes that are only in someone's head or in a folder somebody hasn't opened in two years.
Three concrete problems were costing them measurable money. The first was associate ramp-up time: a new associate spent roughly the first four months of their tenure asking partners for context that was, in principle, written down somewhere. The second was precedent rediscovery: senior staff regularly produced memos for new matters that turned out to be 70% identical to memos produced two or three years earlier for similar matters. The third was cross-practice referrals: the firm's tax practice would frequently fail to spot opportunities to bring in the corporate-finance practice on the same client, and vice versa, because nobody had a unified view of the work being done across the firm.
They had tried two prior tools. One was a SharePoint search front end with a thin LLM wrapper bolted on; it returned answers that confidently cited documents that, on inspection, said the opposite of what the assistant claimed they said. The other was a vendor SaaS knowledge product that didn't respect their permissions model — paralegals could see partner-only research, partners could see other partners' confidential matters — which made it unshippable in a regulated environment. Both had been abandoned.
We were brought in to build a third attempt. We had four months and a build budget that was, by their standards, modest. Here is what we shipped.
02 · ArchitectureThe end-to-end system
The diagram looks busy. Most of the boxes are not the interesting parts — they are the places where, if you skip them or do them lazily, the system fails in subtle ways. The interesting parts are the shape of the ACL handling, the answer composer's hard rules, and the eval harness. We'll go through each of those in detail.
03 · SourcesIn order of how badly each one bit us
The firm had six source systems they wanted indexed. We ingested all six but with different treatment for each, because they had wildly different signal-to-noise ratios and very different permission models.
iManage (the document management system) is the obvious source and was the largest by volume — about 280,000 documents, of which we eventually ingested about 190,000 after filtering. The filtering threw out very old drafts, superseded versions, and a category of documents the firm marked as "transient" that mostly contained interim work product nobody should be searching against. iManage gave us per-document ACLs cleanly, which made the permissions architecture tractable; without that, the project would have looked very different.
The matter management and billing system was the source where we extracted the structure of the firm rather than the content of any individual matter. Who worked on what. How long matters take. Which clients the firm has. What the rate cards looked like. This source rarely produced direct answers but was load-bearing for context: when someone asks "have we worked on a deal like this before", the matter system is what lets the assistant identify candidate prior matters before the document store provides the substance.
The curated precedent library was the highest-quality source per document — these were precedents the firm had explicitly maintained as canonical examples of good output. We weighted retrieval results from this source 1.6× higher than from iManage by default, because false positives from the precedent library are much rarer than false positives from the general document store. The firm had been quietly maintaining this library for years and was thrilled to have something that finally surfaced its content.
The email archive was the trickiest source and the one where we made the most cautious call. Email contains an enormous amount of institutional knowledge that exists nowhere else — partners explaining why a deal was structured a particular way, associates writing up oral feedback from clients, etc. It also contains personal communications, draft correspondence that was never sent, internal disagreements that should not be surfaced as canonical answers, and confidentiality minefields. We indexed only the emails that were tagged to a specific matter (so the ACL came from the matter, not from the inbox), and we excluded sent-but-not-final drafts. The decision to be cautious here was correct in retrospect — the few false positives we caught in evaluation were almost all from emails that lacked context.
The internal know-how wiki was the lowest-value source we ingested, mostly because it was poorly maintained. A wiki built up over six years contained a lot of stale procedure, deprecated forms, and "we've moved on from this" content with no marker that it was deprecated. We ingested it but down-weighted it, and we set up a monthly partner review of which wiki pages were producing answers — a feedback loop that, after three months, prompted the firm to delete or update about 30% of the wiki's content. The assistant ended up improving the source it was reading from. This was an unexpected second-order benefit and probably the single most valuable thing the project produced for the firm's long-term operations.
The regulatory feeds were a small third-party data source — three commercial regulatory-update services the firm subscribed to. Treated separately from internal content, used only when a query was about regulatory state, and answers from this source were always presented as "according to [feed name], as of [date]" rather than as institutional knowledge.
A pattern emerged from this list that we now apply to every knowledge project: the value of a source is not the volume of content it contains; it is the volume of content it contains that has been validated by someone. The precedent library was tiny by volume and disproportionately valuable. The email archive was huge by volume and required the most aggressive filtering.
04 · PermissionsThe part most projects botch
The permissions architecture is the part of a knowledge assistant that, if it's wrong, makes the entire project unshippable. The first failed prior attempt at this firm had foundered entirely on this — paralegals could see partner work, partners could see each other's confidential client matters, and the project was killed in compliance review.
The model we shipped has three principles. They are simple to state and infinitely more annoying to implement.
The first is the ACL is enforced at retrieval time, not at answer time. The assistant never sees a document the user is not allowed to see. It cannot accidentally synthesise from a document it shouldn't have access to, because that document was filtered out of the retrieval set before the LLM was invoked. This is the only architecture that survives security review. The naïve alternative — retrieve everything, then ask the LLM to "only use documents you're allowed to use" — does not work, will not work, and we will not ship it.
The second is the user's effective ACL is computed at every query, not cached on a session basis, because permission changes (a paralegal added to a matter, a partner moved off it) need to take effect within the same hour. We compute the user's full visible-document set as a query-time filter rather than maintaining a per-user index. This adds latency — about 80–120ms per query — and it is worth every millisecond.
The third is per-source ACL inheritance, with a documented rule for each source. Documents in iManage inherit the matter's ACL. Emails inherit the matter they're tagged to. The wiki has its own role-based ACLs that we mirror. The precedent library is firm-wide readable but has explicit per-document confidential markers that are honoured. The regulatory feed is firm-wide readable. We documented the rule for each source in a single page that compliance reviews quarterly. When an ACL question arises, that page is the source of truth.
The single most important architectural property of this matrix is that it is enforced at retrieval, not at presentation. The LLM never receives a document the user couldn't open by hand. There is no "trusted prompt" or "instruction not to mention" that could leak content. Permission changes propagate within the next polling cycle (15 minutes). The audit log records, for every answer, which document IDs were retrieved on behalf of which user, so that compliance can reconstruct any answer after the fact and verify the user's effective access at the moment of retrieval.
05 · RetrievalWhere the actual quality lives
The retrieval pipeline is the part of a RAG system where most of the actual quality comes from. The "use a vector database" part is the easy part. The hard parts are query rewriting, hybrid retrieval, reranking, and per-source weighting.
Query rewriting: when a user asks "what did we agree about earn-outs in the Pampaneira deal", the literal query is not what we should embed. We rewrite it (using a small, fast model) into a richer search representation: an expanded synonym set ("earn-out", "contingent consideration", "deferred payment"), a probable matter name, a probable document type ("share purchase agreement, term sheet, side letter"). This rewriting is itself prompt-engineered against a held-out set, and it is one of the few places where we measure individual prompt quality with eval scores.
Hybrid retrieval: we run the rewritten query through both a dense vector search (against an embedding index) and a classical BM25 keyword search, and we combine the results. Vector search is good at semantic matches and bad at exact-phrase matches; keyword search is the opposite. For legal content, where specific terms of art ("indemnification cap", "MAC clause", "hell-or-high-water") matter precisely, hybrid retrieval consistently beats either method alone. We run both in parallel with a budget of 60 candidates each and merge.
Reranking: the merged 120-candidate set is run through a cross-encoder reranker — a smaller model that scores the relevance of each candidate to the original query directly, rather than via embedding similarity. This is more expensive per candidate but operates on a much smaller set and gives a substantially better top-10. We keep the top 12 after reranking.
Per-source weighting: as mentioned earlier, candidates from the precedent library are boosted 1.6×; candidates from the email archive are penalised 0.7× by default but can be boosted by the answer composer if the query is clearly about a specific matter; candidates from the wiki are penalised 0.5× because we don't trust it as much. These weights were arrived at empirically, by running the eval harness with different weight settings and selecting the configuration that produced the highest answer quality on the golden set.
Citation packaging: every chunk that survives reranking is tagged with its source document, page or section reference, and a stable URL into the underlying system (so the user can click through to the actual iManage record, the actual email thread, the actual wiki page). The answer composer is required to use these tags. An answer that cannot cite a chunk is rejected and the assistant returns "I don't know" rather than fabricating one.
06 · Hard rulesThe answer composer's three commitments
We use a single off-the-shelf large model for answer composition (the model choice has changed twice during the project's lifetime, both times without architectural changes — see the maintenance discussion in the ROI piece). The architecture is in the prompt, the rules around the prompt, and the validation of the output. There are three hard rules.
Rule one: every claim in the answer must cite a chunk. The prompt is structured so that the model produces output as a sequence of (claim, citation) pairs, and a post-processing step rejects any answer that contains text not attached to a citation. If the model ignores this, the answer is dropped and a "the assistant could not produce a citation-grounded answer" message is returned to the user. This rule is non-negotiable. Violations are tracked as a quality metric and trigger a prompt revision when they exceed 1.5% of answers in a week.
Rule two: "I don't know" is the right answer when retrieval is weak. The composer is given a confidence signal from the reranker (top result's score, distribution of scores across the top 5), and it is explicitly instructed that low-confidence retrieval requires an "I don't know" answer with a recommendation of who to ask instead. We measure the "I don't know" rate as a quality signal — too low means the model is bullshitting, too high means retrieval is failing. The healthy range, based on six months of operation at this firm, is 8–14% of queries. Anything outside that band gets investigated.
Rule three: confidence is categorical, not numerical. Answers are tagged "high confidence" (multiple corroborating sources, recent), "medium confidence" (single primary source or partial corroboration), or "verify with a partner" (low retrieval confidence or sensitive subject matter). The user sees the label, not the underlying score. The label affects the answer's framing — "verify with a partner" answers explicitly recommend escalation. This was a deliberate decision after the firm's compliance team found that numerical confidence scores ("87% confident") were being interpreted as much more authoritative than they should be by junior staff.
07 · Eval harnessThe 6% of work that decides the other 94%
A knowledge assistant without an eval harness is a system whose quality you can't measure, can't improve deliberately, and can't catch regressions in. We will not ship one without it.
— On the eval harness
This is the section that almost no public write-up of a knowledge assistant covers and that, in our experience, separates the working systems from the broken ones. A knowledge assistant without an eval harness is a system whose quality you can't measure, can't improve deliberately, and can't catch regressions in. We will not ship one without it.
The harness has four parts.
The golden set is 312 questions, each with a known good answer, validated by a partner. The questions span every source, every role's typical query patterns, and every kind of answer the system should be capable of (factual lookup, multi-document synthesis, "have we done X before", "what's the right precedent for Y", "summarise this matter"). About 40 of the 312 are negative examples — questions the assistant should refuse or "I don't know", because the right answer requires content outside the corpus or because the question is genuinely unanswerable. The negative set catches "the assistant got too confident" regressions, which are otherwise invisible. The golden set is reviewed and expanded quarterly.
Weekly automated runs execute the full golden set against the live system every Sunday night. For each question, the system records the retrieved chunks, the answer, the citations, and a set of derived metrics: did the answer cite the expected source documents (citation precision/recall), did the answer contain the expected key facts (we use a separate LLM-as-judge here, with a constrained rubric), what was the confidence label, was the answer in the right "I don't know" category for negative examples. The output is a single dashboard the team reviews on Monday morning.
Regression detection compares each weekly run against the baseline of the prior four weeks. Any question whose quality score drops by more than a defined threshold triggers a review. About 80% of regressions turn out to be data drift (a source document was changed, a permission was altered, a wiki page was edited) and require no code change. The remaining 20% are real and require investigation.
Live sampling: separately from the golden set, we sample 30 random user queries each week and have a partner score them blind. This catches real-world query distributions the golden set might miss, and it keeps the partners involved in the system's quality, which matters for both governance and trust.
The whole eval harness — golden set, weekly run, dashboard, sampling — is roughly 6% of the total project effort. It is the single highest-leverage 6% of any RAG project we have ever built. Without it, you do not actually know whether your system is good or bad; you have impressions, anecdotes from users, and an undifferentiated "it feels okay" sense that tells you nothing.
08 · Failure modesThe seven we caught before launch
This is the section we promised in the title. These are the seven specific ways the system was broken or misbehaving during the build phase, all caught by the eval harness or by partner-reviewed sampling, and what we did about each.
Failure mode 1: confident answers from one document the system shouldn't have trusted. The system was happily synthesising answers from a single iManage document, which in several cases turned out to be a draft that was never finalised, or a memo from an associate that contradicted the partner's later guidance on the same matter. The retrieval system had no way to distinguish "draft" from "final" within iManage. Fix: we ingested iManage's document-status metadata as a retrieval-time filter and excluded any document marked as draft from the indexable corpus, unless the user explicitly invoked an "include drafts" command (we kept this for the small number of cases where someone wanted to find a specific draft).
Failure mode 2: stale answers about regulation. Questions about regulatory state were being answered from internal memos written 2–4 years earlier, even when the regulation in question had since changed. The system had no concept of "this content is time-sensitive". Fix: we tagged a subset of source documents with a "regulatory currency" flag (true/false), and for any question the answer composer classified as regulatory, we required that retrieval include at least one chunk from a source with a date within the last 12 months, or the system returned a "this is a regulatory question; here is what we found in our archive but we strongly recommend checking current law with a partner" framing.
Failure mode 3: cross-matter information bleed via the email archive. A query about Client A was returning information from emails on a structurally similar matter for Client B, in cases where the matter ACL allowed the user to see both. This was technically permitted by the ACL but was a clear breach of the firm's working ethics around treating client matters as independent. Fix: we added a per-query matter context, and when a query was clearly about one specific matter (extracted by the query rewriter), retrieval was restricted to that matter's documents plus the precedent library. Cross-matter retrieval requires an explicit "search across matters" toggle.
Failure mode 4: the wiki contradicting itself. Several pages on the wiki had been written and never updated, then later partially updated, leaving the page internally inconsistent. The assistant would happily quote a sentence from one paragraph that contradicted the next. The eval harness caught this when the same question, asked twice in slightly different phrasings, returned contradictory answers from the same wiki page. Fix: we built a "self-consistency" check on wiki pages — if a single page produced material disagreement across our paraphrased golden questions, the page was flagged for human review and excluded from indexing until a partner triaged it. About 14% of the wiki was in this state at first run.
Failure mode 5: the assistant inferring answers it shouldn't. A question like "what's our typical fee for this type of matter" was being answered by the model averaging across the matter management data and producing a number. This was wrong on two axes: first, the user asking might not have permission to see the underlying data; second, fee structures depend on context (client relationship, matter complexity, partner discretion) that the model couldn't see. Fix: we added a class of "do-not-infer" topics — fee quotes, hiring decisions, performance reviews, conflict-of-interest assessments — where the system is required to return "this requires a partner to assess" rather than producing a synthesised answer. The list of topics was defined by partners and is reviewed quarterly.
Failure mode 6: latency creep nobody noticed. Over the build phase the p95 query latency had crept from 2.1s to 5.4s as we added retrieval components. No single change caused it; each one added 300–700ms. The eval harness now records latency as a first-class metric, and any p95 over 4.5s on the weekly run triggers an investigation. We brought it back to 3.1s by parallelising retrieval calls that had been running serially.
Failure mode 7: feedback flowing into prompts without provenance. During the first week of beta, well-meaning testers were giving thumbs-down with rationales like "this is wrong because Maria knows", and we were quietly using those rationales to adjust the retrieval prompt. Three weeks in, we discovered the system had drifted away from a few correct answers because of feedback that turned out to be wrong on inspection. Fix: feedback is now logged with full provenance, but no automated process modifies the prompt or the retrieval ranking based on user feedback. All such changes go through engineering review with the eval harness as the gate. Feedback shapes the curator queue and the golden-set expansion; it does not directly touch the system.
09 · After launchNine months in production
The system has been live for nine months at the time of writing. The firm has not abandoned it — which, given their prior history with two killed projects, is the metric they care about most. The numbers, lightly anonymised:
The associate ramp-up time problem has measurably improved. New associates joining the firm in the last three months are reporting (informally and via a survey) that they ask partners 30–45% fewer "where do I find" or "what did we do for X" questions in their first three months. The partners are unanimous that this has freed real time.
The precedent rediscovery problem has been partially solved. Senior staff now check the assistant before drafting a memo for a new matter, and in roughly 40% of cases find a substantially relevant prior memo to start from. They estimate this saves between 2 and 6 billable hours per memo, on memos that previously took 8–14 hours to write from scratch.
The cross-practice referral problem has not been solved by the assistant alone, but the assistant has flagged 11 instances in the first six months where a matter handled by one practice could have been escalated to another, and the partners reported that 4 of those 11 produced new engagements that wouldn't otherwise have happened. This is a small number but a high-value one.
The single most surprising outcome was on the wiki. The system's existence created a feedback loop — pages that produced bad answers were either deleted or rewritten — that improved the underlying knowledge base independently of the assistant. This is, in retrospect, the kind of compound effect that justifies building this sort of system at all. The assistant didn't just answer questions; it raised the quality of the source material it was reading from, and that improvement persists even if the assistant itself goes away.
10 · What we'd changeThree things, in order of importance
We would build the eval harness first, before any retrieval or any prompt. We built it in the second month and we should have built it in the first week. Every architectural decision we made in the early weeks would have been better with the harness in place, because we'd have been making them with measurement instead of intuition.
We would have a much more aggressive starting policy on the wiki. We were polite about ingesting it; we should have been ruthless. Roughly 30% of the value of the project came from cleaning the wiki, and we let that take six months to happen instead of doing it in the first month.
We would set the user expectation about "I don't know" at the very first interaction, not via training later. The single most common piece of negative early feedback was "the assistant won't answer this", and the right response was "yes, that's the design, that's why you can trust the answers it does give you". We eventually published an internal note saying exactly this. We should have built it into the first-touch onboarding from day one.
This is a long write-up of one project. We've published it in this depth because almost every public write-up of a knowledge assistant skips the parts that matter — the permissions architecture, the eval harness, the failure modes. Those are the parts that make the project either ship or quietly die. The model and the prompt, by comparison, are the easiest 30% of the work.
If you have a similar project under way and want a second opinion on the architecture before you build it, we offer architecture reviews as a fixed-scope engagement. We will tell you, candidly, when we think the project shouldn't happen — about a third of the reviews we do end with that recommendation, and clients have so far thanked us for it.
Architecture write-ups, every other Tuesday.
Subscribe to The SMB Automation Brief — one anonymised engagement with real numbers, one common mistake we're seeing this fortnight, one tool worth knowing about. 8,400 operators reading.