How we measure

Every number carries its own definition, window, and sample size.

Rates without denominators are marketing, not measurement. This page is the methodology behind every stat PilotPM publishes — what counts, what doesn't, what window it was measured in, and how we prove an improvement is real.

Book a 30-min demo Start free →

In a pilot, every number below gets counted live on your own tickets.

Automated resolution · our definition

Resolved means no human ever touched it.

A conversation counts as automatically resolved only when the AI answered it and the customer either confirmed the answer or asked nothing further. If a teammate steps in at any point — even one clarifying line — the conversation leaves the automated column. That's the strict end of the industry's definitions, on purpose.

Confirm-or-silence, over AI-answered

The numerator is conversations the AI actually answered where the customer confirmed or had nothing further to ask — not “the bot said something and the ticket timed out.”

A takeover is a failure

Any human involvement disqualifies the conversation — a reassignment, an internal edit that ships, a follow-up from a rep. No partial credit, full stop.

Counted strictly, published anyway

Measured this way our number reads lower than most published rates. We'd rather publish a smaller number that means something than a bigger one that doesn't.

The denominator

A rate never ships without its involvement share.

90% of almost nothing is almost nothing. As an illustration: a system resolving 90% of the 3% of conversations it touches is automating far less than one resolving 55% of the 95% it handles — yet the first one quotes the bigger headline. So every resolution rate we show comes with the share of all inbound the AI was involved in, and neither number can hide behind the other.

Resolution rate — of what the AI handled

Of the conversations the AI was involved in, the share it resolved end-to-end under the definition above. The flattering number, kept honest by its neighbor.

Involvement share — of everything inbound

Of everything that arrived, the share the AI was involved in at all. The number that says whether the automation is load-bearing or a rounding error.

Our own current pair is pending reference clearance and will be published here once it clears. Until then: in a pilot, both numbers are counted live on your own tickets, with the counting visible.

The category's yardsticks

How the category counts it — as of July 2026.

Published “resolution” definitions differ more than the published rates do. Below are three approaches from vendors' own public documentation — described mechanically, no editorializing. The same support week produces very different rates depending on the yardstick.

Confirmed — or 24 hours of silence

One published definition counts a conversation as resolved when the customer confirms the answer, or simply sends no follow-up for 24 hours after the AI's reply — and a teammate replying in the conversation does not void the assumed resolution. The definition's denominator has also been revised over time, most recently in July 2026 — so a published rate can move without the product changing. Source, as of July 2026 →

72 hours + an LLM check

Another published approach counts a resolution when the requester confirms, or when the conversation sees no further unresolved activity for 72 hours and an LLM verification grades the answer as having addressed the question. Escalation to a human disqualifies the conversation. Source, as of July 2026 →

Any human involvement disqualifies

The strictest published definition counts a conversation only when no human was involved at any point, and additionally grades each interaction as relevant, accurate, and safe using AI evaluation. This is the end of the spectrum our own definition sits on. Source, as of July 2026 →

None of these are “wrong” — they're different instruments. The takeaway is simpler: never accept a resolution rate without its definition. Ask any vendor — including us — what's in the numerator, what's in the denominator, and what a human's involvement does to the count.

Draft metrics

Drafted, reviewed, approved, verbatim.

Reply drafting has its own ladder of honesty, and each rung is a different number. All three below are measured on production workspaces over the last 30 days, and each carries its window and sample in the dashboard.

97%

AI-drafted share

The share of outbound replies that began as an AI draft. It says the AI is in the loop — not that it's right. Production, last 30 days.

42%

Acceptance of reviewed drafts

Of the drafts a human actually reviewed, the share that shipped approved. Edits allowed — this measures “good enough to send,” not perfection. Production, last 30 days.

16–20%

Verbatim rate — the hard number

The share sent exactly as the AI wrote it, without a single edit — on email, currently 16–20%. It climbs every week, because the engine mines every edit your team makes.

What counts: a draft is “reviewed” when a human opened and dispositioned it; “approved” when it shipped through the approve action; “verbatim” when the sent text matches the drafted text exactly. Drafts that were superseded or never reviewed sit outside the acceptance denominator — they don't inflate it.

Effective automation · all inbound

~22% of all inbound, end-to-end.

Across everything that reaches our production workspaces — every channel, measured July 2026 — about 22% is currently handled end-to-end with no human touching it. This is the number behind the ~$0.45 effective cost per resolution on our pricing page.

The honest footnote

Most of that 22% today is machine noise — delivery bounces, receipts, automated notifications — that the system classifies and auto-archives, not customer questions the AI answered. The customer-question share is smaller and growing.

Why we count it — and label it

Auto-archiving noise is real work a human no longer does, so it belongs in an “effective automation” number. But it's labeled: your dashboard splits archived noise from answered customers, so one never dresses up as the other.

Windows & anchoring

Numbers anchored to complete workdays.

A metric that includes the half-day in progress will happily invent a trend by lunchtime. Ours don't.

Complete-workday anchoring

Every window ends at the last complete workday — this morning's quiet inbox is not a downturn, and a busy hour is not a spike.

Per-channel trajectories

Email, chat, and store reviews behave differently, so each channel carries its own trend line — no blended average that hides one channel degrading behind another improving.

Change markers

Every improvement that ships pins a marker to the chart on the day it landed. When a line moves, the chart says why — you never reverse-engineer a trend from memory.

Self-improving, on a 6-hour clock

How the numbers improve — and how we grade the grader.

Every 6 hours, the improvement engine mines the edits your team made to AI drafts, works out what it keeps getting wrong, and proposes the fix — a reply rule, a KB correction, sometimes a code change. Nothing ships on vibes.

Mine the edits

Every human edit to an AI draft is kept as ground truth and diagnosed to a root cause — knowledge, rules, data, or product.

→

Golden replay

Every proposal re-runs against a golden set of real past conversations. The number moves, or the change doesn't ship.

→

Canary + tripwire

Changes launch to a 50% canary. Per-channel degradation detection auto-rolls-back anything that makes replies worse.

→

You approve what stays

Every improvement lands behind human approval, and a change marker pins it to the chart on the day it shipped.

Grading the grader — because an eval is only as honest as its judge:

Order-swapped, multi-model judging — candidate replies are compared by a panel of models with positions swapped, so no judge can favor a side by where it sits.
Paired confidence intervals — “better” means statistically better on the same conversations, not noise dressed as progress.
Judge-vs-human calibration — judge verdicts are periodically checked against human review, and the judging protocol itself only changes through the same eval gate.

Fair questions

What counts, honestly answered.

What counts as an AI resolution at PilotPM?

A conversation counts as automatically resolved only when the AI answered it and no human ever took over — the customer confirmed the answer or asked nothing further after the AI's reply. A human takeover at any point counts as a failure, full stop. Every resolution rate we show carries this definition, its time window, and its sample size.

Why do you publish the involvement share next to the resolution rate?

Because a resolution rate without its denominator can hide almost anything: resolving 90% of the 3% of conversations an AI touches is less automation than resolving 55% of the 95% it handles. We always show what share of all inbound the AI was involved in alongside how much of that it resolved, so neither number can hide behind the other.

Why does your automation number look lower than other published rates?

Mostly definitions. Some published definitions count 24 hours of customer silence as a resolution, or keep a conversation in the resolved column even when a teammate replied. Ours doesn't: any human involvement disqualifies the conversation. Measured on the strict end of the industry's yardsticks, the number is smaller — and it means something. Ask any vendor, including us, for the definition behind the number.