Pilots: What to Measure & How

The purpose of a pilot is to make a decision: should we scale up this AI implementation or go back to the drawing board? Data can help you make this decision. 

Setting up and trying out a measurement plan now also gives you a foundation for the future. Once you scale up, you will want to have a tested measurement plan and a baseline to which you can compare future growth, adjustment, or alternatives.

But what should you measure?  

A pilot answers one question: do we scale this, or stop?
Everything about measurement should serve that decision.

Below is a tightened, expanded version of your section, with concrete definitions, examples, and rationale pulled from the logic of Amplify Good Work—especially the chapters on measurement, values, and human–AI oversight. I’m making one assumption explicit up front.

1. North-star mission outcomes

(Lagging indicators; slow; non-negotiable)

These are the outcomes your organization exists to produce in the world. They usually change slowly and cannot be fully evaluated during a short pilot, but still anchor the measurement plan.

Examples:

  • Housing stability at 12 months

  • Employment retention six months after placement

  • Health outcomes, not just service utilization

  • Educational attainment, not attendance alone

Why they matter:
Mission outcomes protect you from optimizing the wrong thing. A system that speeds up intakes but worsens long-term housing stability is not a success.

In a pilot, you usually cannot yet observe change in these outcomes. That’s expected. The goal is not to measure improvement now, but to set yourself to track long-term measurement success.

For now, think of these as the destination on the map. You don’t judge progress by whether you’ve arrived, but by whether you’re still headed the right way.

2. Program and stakeholder outcomes

(Near-term; actionable; decision-shaping)

These outcomes sit between operations and mission. They respond faster and help you interpret whether observed efficiency gains actually translate into better service.

Examples:

  • Completion rates for applications or referrals

  • Time from first contact to service delivery

  • Client-reported clarity, trust, or perceived respect

  • Staff confidence in decisions supported by the system

Why they matter:
These measures explain how operational changes may eventually affect mission outcomes. It will be important to understand as much as you can about the causes of success in case you need to pivot, want to expand, or try to replicate your program in another context. They are also where many AI-related failures first appear.

These indicators give you levers. You can adjust workflows, prompts, thresholds, or training while the pilot is still cheap to change.

3. Operational outputs

(Fast; concrete; necessary but insufficient)

These are the easiest metrics to collect, and the most tempting to overvalue.

Examples:

  • Time per intake

  • Cases processed per week

  • Staff hours saved or reallocated

  • Backlog size

Why they matter:
Operational outputs tell you whether the system does what it claims to do. If AI does not reduce time, effort, or bottlenecks, there is little reason to proceed.

Why they are not enough:
Efficiency gains can mask harm. A system that processes applications faster but systematically misroutes non-English speakers has excellent operational metrics and terrible mission alignment.

Operational outputs answer: Is the machine working?
They do not answer: Is this good?

4. Guardrails

(Always on; threshold-based; scale-stopping)

Guardrails define what you will not accept, even if everything else looks positive.

Typical guardrail domains:

  • Equity: outcome parity across language groups, race, disability status, or geography

  • Quality: error rates, reversals, or downstream corrections

  • Workforce well-being: burnout, deskilling, moral distress, loss of professional judgment

  • Trust and safety: complaints, opt-outs, adverse incidents

Why they matter:
Guardrails protect against the most common failure mode in AI pilots: incremental harm hidden by average improvement. Guardrails force you to look at distributional effects and edge cases, where mission-driven organizations are most vulnerable.

Decide thresholds in advance

Before the pilot begins, write down what counts as acceptable change and what does not.

This matters because once staff adapt, leaders grow invested, and sunk costs appear, it becomes much harder to pull back or stop.

Examples:

  • Error rates must remain below 5%

  • Client satisfaction must not decline relative to baseline

  • Outcome parity across language groups must remain within a predefined range

  • Staff must retain final decision authority in high-risk cases

Exceeding a threshold doesn’t have to mean trashing the entire pilot, but it is a signal that you need to reevaluate. 

A concrete example

A housing assistance pilot introduces AI-supported intake triage.

Observed during the pilot:

  • Average intake time falls from 4.6 days to 2.9 days

  • Application completion rates increase

  • Client trust surveys remain stable

  • Spanish-speaking clients see improvements comparable to English-speaking clients

Twelve-month housing outcomes are not yet observable, but the infrastructure to track them is in place.

The pilot passes guardrails, shows plausible links between efficiency and access, and preserves equity. Leadership proceeds to scale with continued monitoring. 

Lock a baseline from the beginning

Collect 8–12 weeks of pre-pilot data on every metric you plan to use.

Without a baseline, statements like “10 intakes per week with AI” are not that helpful. How many were we doing before? 

Baselines also help surface variation that would otherwise be misattributed to AI, like seasonality, staff turnover effects, or policy changes. 

If baseline data is unavailable, say so explicitly and treat early findings as exploratory, not decisive.

Previous
Previous

Rolling out your first AI pilot

Next
Next

How to choose your first AI pilot