"Human-in-the-loop" is the phrase that gets vendors past compliance reviews. It is also, in most implementations, a fig leaf. The system technically has a human reviewing its decisions, and that human is approving 98% of them in under three seconds because the volume is too high to do anything else, the interface only shows them a green button, and nobody has defined what a thoughtful review even looks like. The compliance team signed off, the reviewer is overwhelmed, the model is doing whatever it wants, and the company is now exposed to exactly the risks the human was supposedly preventing.
This piece is the reference architecture we use when we build AI systems that genuinely need human oversight — for regulatory reasons, for ethical reasons, for "the cost of being wrong is too high" reasons. It covers when to escalate, who the right reviewer is, what they should see, what they should not see, and the drift problem that quietly undermines every HITL system that doesn't account for it.
01 · Where they failThe four common HITL failure modes
Almost every failed HITL implementation we've inspected fails in one of four ways, and they're worth naming because each has a different remedy.
The first failure is rubber-stamping: the reviewer is shown an output and a binary approve/reject button, with no context, no reasoning, no real expectation of doing anything other than approve. Approval rates climb to 99%+ within a month and the human's role becomes ceremonial.
The second is alert fatigue: the system escalates so much that the reviewer's queue is permanently overflowing. Reviews become triage; thoughtful review becomes impossible; the things that should have been escalated are buried under things that shouldn't have been.
The third is wrong reviewer: the person doing the review doesn't have the expertise to actually catch the kind of error the system is most likely to make. A junior support agent reviewing a model's medical-coding suggestion isn't going to spot the subtle ones; an executive reviewing a content moderation decision isn't going to want to spend the time on it. Wrong-reviewer failures are particularly insidious because the system looks like it has oversight.
The fourth, and the most subtle, is drift around the gate: the model learns, implicitly or explicitly, to phrase its outputs in the way most likely to be approved by the reviewer, regardless of whether those outputs are correct. The gate stops catching the things it was meant to catch because the model is now optimising to bypass it.
A good HITL architecture has to address all four of these, not just one or two. Most of the systems we audit address one. The reference architecture below is what addressing all four looks like.
02 · The architectureEnd to end
The architecture has five things that have to be present for it to work: a gate classifier that makes a categorical routing decision, four well-defined routes (not just "approve" and "reject"), an audit log that is written for every decision regardless of route, a drift-detection layer that runs against the audit log, and a feedback loop that flows from reviewer decisions into an offline eval set rather than into the model directly. Each of these is doing real work. Removing any one of them produces one of the four failure modes from the previous section.
03 · The four routesWhy two routes are not enough
The single biggest mistake in HITL design is the binary route — auto-approve or send to a human. The right design has four routes, with explicit criteria for each.
Auto-approve is for outputs that are low-risk and high-confidence by the gate classifier's measure. The system acts directly. This is necessary, not optional — without an auto-approve path, the human queue overflows and reviewer fatigue eats the whole system. The criteria for auto-approve must be explicit, written down, reviewed by the team that owns the system, and conservative enough that the routine cases really are routine. We typically aim for 60–80% of decisions to be auto-approved in a healthy system; below that, the reviewers can't keep up, above that, the gate is probably too lax.
Reviewer queue is for medium-risk outputs that need human judgment but not specialised expertise. Trained operations staff handle these. The crucial design choices are queue length (we target a 4-hour median time-in-queue and an 8-hour 95th-percentile; longer than that and the reviewer experience degrades into triage) and the per-item review interface, which we'll cover in detail below.
Expert escalation is for outputs that are high-risk or that touch a domain where ordinary reviewer judgment isn't enough. These go to a domain expert — a partner in a law firm, a senior clinician in a healthcare context, a senior account manager for the strategic-account stuff in a sales context — with a much longer SLA (we typically write 1 working day) and an explicit acknowledgment that the expert may push back on the system's framing of the question itself. Expert escalations should be rare; we target 2–6% of total volume. If they're more than 10% of volume, the gate is escalating things that should be in the reviewer queue, which exhausts your most expensive resource.
Hard refusal is the route that almost no HITL implementation includes, and it is the most important one. There are some inputs the system should not respond to at all, with no human review path, because the right answer is "this is outside what the system does, here is what you should do instead". A medical-advice question to a customer support assistant. A pricing-discretion question to a sales drafting tool. A confidential-personnel-matter question to a knowledge assistant. The system should refuse, explain why, and route the user elsewhere. Without a hard-refusal route, every weird input ends up in the expert queue, where it shouldn't be.
04 · When to escalateThe risk × confidence routing matrix
The gate classifier's decision is a function of three inputs: the risk of the action being proposed, the model's confidence in its output, and any policy rules that override both.
A few things about this matrix that matter in practice:
Risk is set by policy, not by the model. What counts as "high risk" is a written, reviewed, version-controlled list maintained by the team that owns the system, with input from compliance and from the operations leads. It is not something the model decides per-call. This is the single biggest difference between a HITL system that works and one that doesn't: the categorisation of risk is a human-owned, slowly-changing artifact, while the match of a specific input to that categorisation is what the model does.
Confidence has to be honest. Model confidence is notoriously unreliable when self-reported, so we typically derive it from the retrieval pipeline (top-result relevance score, score distribution across the top N), from explicit "I'm uncertain because..." patterns the model is prompted to flag, and from agreement between two parallel calls (the same query run twice with slight variations and compared). Self-confidence alone is insufficient.
The high-risk + low-confidence cell is hard refusal, not expert escalation. This is counterintuitive but important. If the system doesn't know what it's doing, an expert reviewer is not going to be able to fix it from one look at the output. Better to refuse and route the user to do it themselves, manually, than to ask an expert to validate a guess. We have learned this the hard way; expert reviewers under pressure will often approve high-risk low-confidence outputs because the alternative is sending the user away empty-handed, and the consequences of those approvals tend to surface weeks later.
A reviewer staring at a screen with an "approve" and "reject" button will approve, fast, almost always. The interface has to make thoughtful review the easier path than rubber-stamping.
— Why the reviewer view matters more than the model
05 · The reviewer viewWhat they see, and what they don't
The reviewer interface is where the HITL pattern most often falls apart in practice. A reviewer staring at a screen with an "approve" button and a "reject" button will approve, fast, almost always. The interface has to make thoughtful review the easier path than rubber-stamping. This is mostly an interface-design problem, but it has architectural consequences.
The interface we recommend has six elements visible in a single screen: the original input, the model's proposed output, the model's reasoning (a short summary, not a chain-of-thought transcript), the source documents or context the model used, the model's self-reported confidence and the categorical risk label, and a structured rejection path that requires the reviewer to indicate why they're rejecting — which is what flows into the offline eval set.
What the reviewer does not see is also important. They do not see other recent decisions on the same item (which would create anchoring effects). They do not see other reviewers' notes on similar items (which would create herding). They do not see the user's identity or anything that could bias the decision toward "this customer is important, approve". And critically, they do not see a default action — there is no greyed-in "approve" button that a tap of the spacebar accepts. Both options require a deliberate click, and the rejection path requires a structured reason.
The structured-rejection requirement is the single highest-leverage interface decision in the entire pattern. Approvals can be one-click; rejections cannot. This makes the marginal cost of approving slightly higher (you have to confirm) and the marginal cost of rejecting much higher (you have to articulate why). The asymmetry is wrong-feeling at first — surely rejection should be easier? — but it is the right design, because the cost of a missed rejection is much higher than the cost of a missed approval, and the friction is doing the work of catching cases where the reviewer was about to rubber-stamp.
06 · DriftThree kinds, each with its own monitoring
Even a well-designed HITL system will degrade over time if you do not specifically design against drift. There are three kinds of drift to watch for, and they each need their own monitoring.
Model drift around the gate is the subtle one. If you use reviewer approvals as a training signal — fine-tuning the model on approved outputs, for example — the model will learn to produce outputs that look like the approved ones, regardless of whether they're correct. Worse, it will learn to produce outputs that are confident in the way reviewers like, which means the model's own confidence signal becomes uncalibrated relative to reality. The remedy is structural: reviewer decisions never feed back into model weights or into the live retrieval pipeline. They feed an offline eval set that engineering uses, with a deliberate human gate, when deciding whether to change the prompt or the retrieval.
Reviewer drift is the slow one. Reviewers, doing the same job day after day, develop heuristics. Some of those heuristics are good; some are wrong. Without intervention, two reviewers will gradually develop different decision boundaries on similar cases, and within a year you will have inconsistent decisions that compliance can't defend. The remedy is monthly inter-reviewer agreement checks: take 20 randomly sampled past decisions, have a second reviewer score them blind, measure the disagreement rate, and have a calibration session when it exceeds a threshold. This is annoying and unglamorous and absolutely worth doing.
Policy drift is the dangerous one. The world changes — regulations change, the company's risk tolerance changes, new failure modes are discovered — and the policy that defined "high risk" was written 18 months ago. Without active policy review, the gate is enforcing stale rules. The remedy is a quarterly policy review with the team that owns the system, plus an explicit re-categorisation pass any time a notable incident occurs. We treat this as a calendar item, not as something that happens when somebody remembers.
The audit log is what makes drift detectable at all. Every decision logged with the input, the model output, the gate's routing decision, the reviewer's decision (if any), and the outcome. Without this log, you cannot detect drift; you can only experience its consequences once they've gone wrong publicly.
07 · What to measureSix numbers, watched weekly
Six metrics, watched weekly, will tell you whether your HITL system is healthy:
Auto-approval rate. Healthy band depends on the use case, but typically 60–80%. If it's below 60%, the queue is overwhelming reviewers; above 80%, the gate is too lax and quality probably suffers.
Reviewer-disagreement rate between the model's proposal and the reviewer's final decision. Watch the trend, not the absolute number. A drop in disagreement over time without an underlying improvement in model quality is suspicious — it usually means reviewers are rubber-stamping more.
Time-to-decision in the reviewer queue, median and 95th percentile. Time-to-decision rising means the queue is overflowing. Time-to-decision falling dramatically also means trouble — usually that reviewers are skimming.
Sampled-overrule rate. Take 30 random past decisions per week and have a senior reviewer score them blind. Measure how often they disagree with the original decision. This is the closest thing you have to "true" quality and it should be the metric you trust most.
Expert-escalation volume and turnaround. If escalation volume rises sharply, the gate may be miscalibrated. If escalation turnaround is climbing, the experts are saturated and you need to either reduce escalations or add experts.
Reviewer-vs-reviewer agreement measured monthly on a sampled overlap. This is the drift indicator.
These six numbers should fit on a single dashboard the team owning the system reviews every week. If they don't, you don't have a HITL system; you have a hope.
08 · What we won't doTwo HITL pitches we always decline
There are two things we will not build under the HITL banner, even when clients ask for them.
We will not build "human-in-the-loop" systems where the human cannot meaningfully override the AI. If the volume is too high, or the SLA too tight, or the reviewer too junior to push back, the human is theatre. We say no to these and explain why. Sometimes the right answer is to slow the system down, sometimes it is to narrow the scope, and sometimes it is not to build the system at all.
And we will not build HITL systems where the reviewer's incentives are aligned against thoughtful review — for example, where the reviewer is measured on throughput rather than on quality, or where pushing back on the AI carries reputational cost within the company. The reviewer's environment determines what they actually do, and "approves three per minute" is what the system will produce if "approves three per minute" is what the company rewards. The HITL pattern only works inside an incentive structure that values the catch over the throughput.
This is the architecture we use, the metrics we measure, and the things we refuse to do. The "human-in-the-loop" phrase is going to keep getting used to launder bad systems through compliance reviews. The phrase is fine. The implementations need to be better. This is what better looks like.
If you have a system in production with a human-in-the-loop story and want a candid audit of whether it's actually working — including the four failure modes and the three drift kinds — we offer fixed-scope HITL audits. We have refused to start three projects in the last year because the audit revealed that the existing oversight was theatre. That's a feature of the audit, not a bug.
Reference architectures, every other Tuesday.
Subscribe to The SMB Automation Brief — one anonymised engagement with real numbers, one common mistake we're seeing this fortnight, one tool worth knowing about. 8,400 operators reading.