Pilots: What to Measure & How
The purpose of a pilot is to make a decision: should we scale up this AI implementation or go back to the drawing board? Data can help you make this decision.
Setting up and trying out a measurement plan now also gives you a foundation for the future. Once you scale up, you will want to have a tested measurement plan and a baseline to which you can compare future growth, adjustment, or alternatives.
But what should you measure?
A pilot answers one question: do we scale this, or stop?
Everything about measurement should serve that decision.
Below is a tightened, expanded version of your section, with concrete definitions, examples, and rationale pulled from the logic of Amplify Good Work—especially the chapters on measurement, values, and human–AI oversight. I’m making one assumption explicit up front.
1. North-star mission outcomes
(Lagging indicators; slow; non-negotiable)
These are the outcomes your organization exists to produce in the world. They usually change slowly and cannot be fully evaluated during a short pilot, but still anchor the measurement plan.
Examples:
Housing stability at 12 months
Employment retention six months after placement
Health outcomes, not just service utilization
Educational attainment, not attendance alone
Why they matter:
Mission outcomes protect you from optimizing the wrong thing. A system that speeds up intakes but worsens long-term housing stability is not a success.
In a pilot, you usually cannot yet observe change in these outcomes. That’s expected. The goal is not to measure improvement now, but to set yourself to track long-term measurement success.
For now, think of these as the destination on the map. You don’t judge progress by whether you’ve arrived, but by whether you’re still headed the right way.
2. Program and stakeholder outcomes
(Near-term; actionable; decision-shaping)
These outcomes sit between operations and mission. They respond faster and help you interpret whether observed efficiency gains actually translate into better service.
Examples:
Completion rates for applications or referrals
Time from first contact to service delivery
Client-reported clarity, trust, or perceived respect
Staff confidence in decisions supported by the system
Why they matter:
These measures explain how operational changes may eventually affect mission outcomes. It will be important to understand as much as you can about the causes of success in case you need to pivot, want to expand, or try to replicate your program in another context. They are also where many AI-related failures first appear.
These indicators give you levers. You can adjust workflows, prompts, thresholds, or training while the pilot is still cheap to change.
3. Operational outputs
(Fast; concrete; necessary but insufficient)
These are the easiest metrics to collect, and the most tempting to overvalue.
Examples:
Time per intake
Cases processed per week
Staff hours saved or reallocated
Backlog size
Why they matter:
Operational outputs tell you whether the system does what it claims to do. If AI does not reduce time, effort, or bottlenecks, there is little reason to proceed.
Why they are not enough:
Efficiency gains can mask harm. A system that processes applications faster but systematically misroutes non-English speakers has excellent operational metrics and terrible mission alignment.
Operational outputs answer: Is the machine working?
They do not answer: Is this good?
4. Guardrails
(Always on; threshold-based; scale-stopping)
Guardrails define what you will not accept, even if everything else looks positive.
Typical guardrail domains:
Equity: outcome parity across language groups, race, disability status, or geography
Quality: error rates, reversals, or downstream corrections
Workforce well-being: burnout, deskilling, moral distress, loss of professional judgment
Trust and safety: complaints, opt-outs, adverse incidents
Why they matter:
Guardrails protect against the most common failure mode in AI pilots: incremental harm hidden by average improvement. Guardrails force you to look at distributional effects and edge cases, where mission-driven organizations are most vulnerable.
Decide thresholds in advance
Before the pilot begins, write down what counts as acceptable change and what does not.
This matters because once staff adapt, leaders grow invested, and sunk costs appear, it becomes much harder to pull back or stop.
Examples:
Error rates must remain below 5%
Client satisfaction must not decline relative to baseline
Outcome parity across language groups must remain within a predefined range
Staff must retain final decision authority in high-risk cases
Exceeding a threshold doesn’t have to mean trashing the entire pilot, but it is a signal that you need to reevaluate.
A concrete example
A housing assistance pilot introduces AI-supported intake triage.
Observed during the pilot:
Average intake time falls from 4.6 days to 2.9 days
Application completion rates increase
Client trust surveys remain stable
Spanish-speaking clients see improvements comparable to English-speaking clients
Twelve-month housing outcomes are not yet observable, but the infrastructure to track them is in place.
The pilot passes guardrails, shows plausible links between efficiency and access, and preserves equity. Leadership proceeds to scale with continued monitoring.
Lock a baseline from the beginning
Collect 8–12 weeks of pre-pilot data on every metric you plan to use.
Without a baseline, statements like “10 intakes per week with AI” are not that helpful. How many were we doing before?
Baselines also help surface variation that would otherwise be misattributed to AI, like seasonality, staff turnover effects, or policy changes.
If baseline data is unavailable, say so explicitly and treat early findings as exploratory, not decisive.

