AI coding agents need KPIs: how to measure speed, quality, reliability, and cost

A quick chat question using GPT-5 mini costs roughly $0.0015. A multi-file refactor using GPT-5.5 costs around $5.50. Same developer. Same day. More than 3,600x cost difference. (Both numbers come from the published per-token rates worked through in the previous post.)

That was the main point of that post: agentic coding breaks flat-rate mental models. Under usage-based billing, the same team can look cheap on seat price and expensive on actual usage.

This post is the follow-up: if your bill is now variable, what should you measure?

Not just cost. Cost alone will make teams defensive. Productivity alone will make finance nervous. You need a scorecard that puts speed, quality, reliability, and spend in the same conversation.

Here is the practical version I would start with.

Short version: do not measure AI coding agents only by seat price, lines of code, or developer sentiment. Measure the full loop: delivery speed, review quality, production safety, and AI credits consumed.

What the cost post already established

The pricing mechanics matter because they define what you can measure.

From GitHub’s usage-based billing docs and announcement:

1 GitHub AI Credit = $0.01 USD
Copilot Business includes 1,900 credits/user/month, pooled at the billing entity level
Copilot Enterprise includes 3,900 credits/user/month, pooled at the billing entity level
existing Business customers get 3,000 credits/user/month during the promotional period (June 1 – September 1, 2026)
existing Enterprise customers get 7,000 credits/user/month during the same window
code completions and Next Edit Suggestions remain included on paid plans and do not consume AI Credits
chat, agent sessions, Copilot CLI, cloud agent, Spaces, Spark, and third-party coding agents consume AI Credits
Copilot code review also consumes GitHub Actions minutes on GitHub-hosted runners
usage reports expose aic_quantity and aic_gross_amount so admins can estimate spend under the new model

That means the useful unit is no longer “Copilot seats”. It is AI credits per useful engineering outcome.

If your team uses Claude Code, Cursor, or direct API access instead of Copilot, the credit language does not apply, but the structure does. Substitute tokens consumed or dollars billed wherever I write “credits” below. The KPI logic is the same.

The previous post used this 10-person Copilot Business example:

Team type	Daily usage pattern	Monthly credit burn	Overage
Light	A few chats, rare agent use	~8,000 credits	None
Moderate	About $12.50/day across the team	~25,000 credits	~$60
Heavy	About $25/day across the team	~50,000 credits	~$310

Those are illustrative scenarios, not GitHub-published averages. But they are useful because they show the shape of the problem: the sticker price for 10 Business seats is $190/month, while a heavy team can behave more like a $500/month team once overage is included.

That is still cheap if quality and speed improve. It is expensive if the team is burning frontier models on trivial work and creating rework.

That is why you need KPIs.

The measurement gap

Most teams still evaluate AI coding with weak proxies:

lines of code written
number of prompts
monthly spend without usage context
vague “dev happiness” signals

These are easy to collect and hard to use.

If you want to know whether AI is helping engineering, you need four buckets:

Speed: are we shipping faster?
Quality: are we creating less rework?
Reliability: is delivery more predictable?
Cost: is AI spend buying useful outcomes?

The KPI model

Start small. Ten to twelve metrics is enough. More than that and the dashboard becomes a hobby.

Speed metrics

KPI	Why it matters	How to calculate
Lead time for change	End-to-end delivery speed	PR open → production deploy
Time to first reviewed PR	Speed of first delivery on new work	Ticket start → first PR opened for review
Review cycle count	Detects churn	Average review rounds per PR

Quality metrics

KPI	Why it matters	How to calculate
Rework rate	Captures avoidable re-edits	% PRs with >2 major revision rounds
Escaped defects	Tracks production quality	Bugs filed within 7 days of release
Test coverage delta	Prevents “fast but fragile” output	Coverage change per merged PR

Reliability metrics

KPI	Why it matters	How to calculate
Change failure rate	Core delivery health signal	% releases causing incident/rollback
Rollback frequency	Flags risky merges	Rollbacks per sprint
First-run CI pass rate	Measures release stability	% pipelines passing without rerun

Cost metrics

KPI	Why it matters	How to calculate
AI credits per merged PR	Normalizes usage to output	`aic_quantity` / merged PRs
Gross AI cost per merged PR	Converts usage to money	`aic_gross_amount` / merged PRs
Overage exposure	Shows budget risk	Gross usage - included credit pool
Heavy-user share	Finds expensive patterns	% spend from top 10% of users

Notice the wording: AI credits per merged PR, not just cost per PR. Dollars are useful for finance. Credits are useful for engineering operations because they map directly to model choice, context size, and agent behavior.

One metric on the quality table misleads more than the others: review cycle count. When it falls, the easy read is that code quality improved. The more common explanation is that reviewers started rubber-stamping AI-generated PRs because the volume went up and the detail became harder to challenge. Track it alongside escaped defects, not on its own.

A scorecard you can actually run

You do not need a giant BI project. A weekly one-page scorecard is enough.

Here is an illustrative scorecard for a 10-person Copilot Business team in the first month after the promotional period ends (so 1,900 × 10 = 19,000 included credits/month), with usage projected from the first two weeks of the month.

KPI	Baseline	Current	Trend	Status
Lead time for change	4.8 days	3.9 days	↓ 19%	🟢
Review cycle count	2.6	2.1	↓ 19%	🟢
Escaped defects / sprint	6	7	↑ 17%	🟡
First-run CI pass rate	71%	79%	↑ 8 pts	🟢
Monthly AI credits	19,000 included	25,000 projected	+6,000 over	ℹ️
Projected overage	$0	~$60	New cost line	ℹ️
Heavy-user share	not yet measured	~40% from top 10% of users	first reading	ℹ️

Use your own baseline and thresholds. The table above is an example layout, not a benchmark claim.

The point is not that $60 overage is bad. It probably is not. The point is that you can now ask the right question: did the extra 6,000 credits buy faster delivery, less rework, or better reliability?

This also keeps the post aligned with the earlier budget recommendation: teams that take agent mode seriously should not plan around the $19 Business sticker price alone. A more honest planning envelope is still $40-$60 per engineer per month, then refine from real usage.

30-60-90 rollout plan

%%{init: {"gantt": {"barHeight": 26, "barGap": 6}}}%%
gantt
    title KPI rollout for AI coding agents
    dateFormat  YYYY-MM-DD
    axisFormat  %b %d

    section Baseline (Day 1-30)
  Define KPIs                    :2026-06-03, 10d
  Collect baseline               :2026-06-11, 20d
  PR metadata                    :2026-06-15, 25d

    section Controlled Usage (Day 31-60)
  Model policy                   :2026-07-03, 12d
  Review guardrails              :2026-07-09, 16d
  Weekly scorecard               :2026-07-13, 20d

    section Optimize and Scale (Day 61-90)
  Compare patterns               :2026-08-02, 15d
  Cut spend                      :2026-08-10, 12d
  Go/no-go                       :2026-08-22, 10d

Days 1-30: baseline and instrumentation

Select one pilot team (5-12 engineers)
Freeze KPI definitions in writing
Capture pre-AI baseline where possible
Download the Copilot usage report if you are already using Copilot Business or Enterprise
Add PR tags:
- ai-assisted: yes/no
- workflow: single-agent | tools | multi-agent
- model-tier: low | mid | frontier
Track aic_quantity and aic_gross_amount from the usage report

Attribution is hard, and self-reporting is unreliable. Developers forget the tag, disagree on what counts as “AI-assisted”, or game it once they realize it is being measured. Treat PR tags as a weak hint, not ground truth. The stronger signals are:

Copilot usage metrics at the organization level, joined to team membership where needed and correlated with PR throughput over the same window
commits authored by copilot[bot] or your cloud agent’s service account
Pull request creators on agent-initiated PRs (Copilot cloud agent, Devin, Codex)

Use PR tags to enrich the picture, not to drive the cost-per-PR calculation.

No baseline? Don’t fake one

Most teams asking these questions have never tracked DORA, let alone an AI-specific scorecard. That is fine. Three workable ways to get a baseline without rewriting history:

Pre-period freeze (4 weeks). Turn instrumentation on, change nothing else for four weeks, then start rollout. The pre-period becomes your baseline.
Pilot vs control. Roll out to one team, hold a comparable team flat for one quarter. Compare deltas, not absolutes.
Phased ramp. Stage the rollout (chat → agent mode → cloud agent) and treat each phase boundary as a checkpoint.

What does not work: claiming a baseline from gut feel, or comparing this quarter to “last year, roughly”. If you cannot defend the baseline in a 5-minute conversation, throw it out and use a pilot vs control instead.

Days 31-60: controlled usage patterns

Set a default model policy (cheap by default, frontier by exception)
Define approved high-cost cases (hard debugging, architecture spikes, incident response)
Decide whether budgets should alert only or stop usage when exhausted
Add lightweight checks for AI-heavy PRs:
- tests included
- security-sensitive changes flagged
- spec/task link present for non-trivial changes

Platform-level enforcement: as of May 26, 2026, GitHub added targeted model rules for enterprise owners. You can now configure which Copilot models are available per organization, and set each model’s availability to Enabled (automatically on for all orgs) or Optional (each org decides whether to enable it). This moves model policy from team convention into platform configuration, so “cheap by default, frontier by exception” becomes enforceable rather than advisory. Available to Copilot Business and Copilot Enterprise customers.

Days 61-90: optimize and decide

Compare high-performing and low-performing PR patterns
Promote what works into team conventions
Remove expensive patterns with no quality lift
Tune budgets by team/workflow instead of using one blunt cap for everyone
Decide scaling path: expand, hold, or redesign

Where each metric comes from

Source	Gives you
Git metadata + PR events	Lead time, review rounds, merge velocity
CI/CD system	First-run pass rate, failure rate, rollback signals
Issue tracker	Escaped defects, rework linkage
Copilot usage report / billing dashboard	`aic_quantity`, `aic_gross_amount`, model mix, spend concentration
Spec and task artifacts	Plan adherence, scope control, handoff quality

If you are running spec-driven workflows, add one extra KPI:

Spec adherence score = % of implemented tasks traceable to approved spec/task artifacts.

That catches silent scope creep early.

I would treat this as an internal quality signal, not an industry benchmark. There is no universal number here. A team doing small maintenance work will look different from a team shipping a new product slice with Spec-Kit or OpenSpec.

How GitHub itself frames this: ESSP, DORA, and the Copilot dashboards

You do not have to invent this framework from scratch. GitHub already publishes one and ships dashboards to power it.

The Engineering System Success Playbook (ESSP)

GitHub’s Engineering System Success Playbook is their answer to “how should we measure an engineering org in the AI era?” It is built around four zones working together, not in isolation:

Quality: defects, change failure rate, security
Velocity: throughput, lead time, deployment frequency
Developer happiness: friction, satisfaction, retention signals
Business outcomes: how engineering work ties back to revenue, cost, and customer impact

ESSP itself lists 12 metrics in total, three per zone. That is roughly the size of the scorecard above, and not by accident.

GitHub explicitly positions ESSP as the framework they used internally to roll out Copilot at scale. The big idea: optimizing one zone in isolation produces bad outcomes. Speed without quality creates debt. Quality without velocity creates stagnation. Happiness without either is a comfort trap. Business outcomes without the other three is a number on a slide that nobody on the team recognises.

Map that onto the scorecard above:

ESSP zone	KPIs from this post
Quality	First-run CI pass rate, defect escape rate, rollback rate
Velocity	Lead time, PRs/week, time to first reviewed PR
Developer happiness	Quarterly DX pulse survey (see below)
Business outcomes	AI credits per merged PR, gross AI cost per merged PR, overage exposure

Spend that buys velocity at the expense of quality is not a win. Spend that buys nothing measurable in any of the other three zones is not a win either.

Developer happiness is the one zone you cannot get from logs. The honest measurement is a short pulse survey, not heavy-user share or model mix. A workable starter, borrowed from the SPACE framework, is four questions on a 1–5 scale every quarter:

How satisfied are you with your AI tooling this quarter?
How often does it block or slow you down?
How much trust do you have in its output for production code?
Would you recommend the current setup to a peer team?

Track the trend, not the absolute score. A sustained drop in question 3 is a leading indicator of rework before it shows up in defect data.

Hard rule: none of these KPIs belong in individual performance reviews. They are team-level signals. Aggregate at the team or org level, never at the engineer level. Older aggregate Copilot metrics enforced a 5-license floor for a reason, and the newer usage reports can expose richer user-level data to authorized admins. That makes the governance rule more important, not less. Individual-level scoring of AI usage breaks trust faster than any cost overrun. If you operate in the EU, loop in your works council before rolling out any per-developer tracking.

DORA, the GitHub way

ESSP sits on top of DORA, the four metrics every platform team knows:

Deployment frequency: how often you ship to production
Lead time for changes: commit to production
Change failure rate: % of deploys causing incidents/rollbacks
Failed deployment recovery time (historically MTTR): how fast you recover from failures

DORA added reliability as a fifth metric in the 2023 report. The classic four are still the operational core; treat reliability as a separate signal when you have the data to back it.

GitHub provides direct hooks for all four through GitHub Actions, Deployments, the Checks API, and Issues. Several community Actions and partner integrations (Sleuth, LinearB, Faros) already feed DORA dashboards from this data. If you are on GitHub Enterprise, you already own the raw events. The work is wiring them together, not collecting them.

One honest caveat: DORA was designed around human commits. Agent-initiated PRs and cloud agent runs blur the lead time definition. A PR opened by copilot[bot] does not cleanly map to “commit to production” when a human still reviews and merges it. No consensus exists yet on how to count this. Track DORA from human-initiated work as your baseline and flag agent-initiated PRs separately until the tooling catches up.

The Copilot dashboards you get out of the box

For the AI side of the scorecard, GitHub Enterprise and Business admins get three first-party surfaces:

1. Copilot metrics dashboard: built into organization and enterprise settings. Aggregates engagement and feature usage with breakdowns by editor, language, and model. Good for “who is actually using this, on what, and where”.

2. Copilot usage metrics REST API: the machine-readable export path for BI tools. This is the current report-based API, not the older /orgs/{org}/copilot/metrics endpoint that closed on April 2, 2026.

# Latest 28-day organization usage report
curl -L \
  -H "Accept: application/vnd.github+json" \
  -H "Authorization: Bearer $TOKEN" \
  -H "X-GitHub-Api-Version: 2026-03-10" \
  https://api.github.com/orgs/ORG/copilot/metrics/reports/organization-28-day/latest

# Single-day organization usage report
curl -L \
  -H "Accept: application/vnd.github+json" \
  -H "Authorization: Bearer $TOKEN" \
  -H "X-GitHub-Api-Version: 2026-03-10" \
  "https://api.github.com/orgs/ORG/copilot/metrics/reports/organization-1-day?day=2026-06-02"

A trimmed response looks like this:

{
  "download_links": [
    "https://example.com/copilot-usage-report-1.json",
    "https://example.com/copilot-usage-report-2.json"
  ],
  "report_start_day": "2026-05-06",
  "report_end_day": "2026-06-02"
}

A few things worth knowing before you wire this up:

The current usage metrics API returns signed download links to report files. Build your pipeline around fetching and storing those files, not around one inline JSON payload.
Reports are generated daily and include complete days only. Do not expect same-hour freshness.
The newer reports include richer organization, enterprise, user, and user-team views. Use the user-level views to aggregate to teams and find enablement gaps, not to rank developers.
The legacy Copilot metrics endpoints were closed down on April 2, 2026. If an internal dashboard still calls /copilot/metrics, fix that before building anything new on top of it.

3. Copilot activity report: a separate CSV-style report for license hygiene, not productivity. It tracks last_activity_at per user (90-day retention) so admins can reclaim seats from inactive users. Pair this with the metrics API: the API tells you what your engaged users are doing, the activity report tells you who has gone dark.

Suggested wiring

A workable starter stack for a Copilot Business or Enterprise customer:

Layer	Source	Refresh
Cost & credits	Usage report (`aic_quantity`, `aic_gross_amount`)	Daily
Engagement & feature mix	Copilot metrics dashboard + usage metrics API reports	Daily
License hygiene	Copilot activity report (`last_activity_at`)	Daily
DORA + delivery	GitHub Actions, Deployments, Checks, Issues	Per event
Spec adherence	Spec/task artifacts in repo	Per PR

Stream all of it into one BI tool (Looker Studio, Power BI, Metabase) and you have the scorecard from this post without buying a separate platform.

One caveat: the Copilot usage metrics API and dashboard only cover GitHub Copilot. If you run Claude Code, Cursor, or direct API access alongside it, you will still need to instrument those separately:

Anthropic / Claude Code: Anthropic Console → Usage and Cost reports, plus the Admin API for programmatic access
OpenAI / direct API: OpenAI platform → Usage dashboard and the /v1/usage endpoint
Cursor: Cursor Admin dashboard for team usage, with SAML/SCIM-gated exports on Business and Enterprise plans
Anything else: if the vendor cannot show you per-team token or request counts, that is itself a KPI signal

Failure modes to avoid

Common trap: teams celebrate speed improvements while defect leakage rises.

Always pair at least one speed KPI with one quality KPI in the same review. Never read them in isolation.

1) Vanity speed wins

Lead time drops. Defects rise. Team celebrates anyway.

Fix: pair every speed KPI with a quality KPI.

2) Cost panic, then productivity collapse

Finance sees a spike. Team gets hard limits overnight. Engineers stop using agents for useful work.

Fix: start with alerts at 75%, 90%, and 100%. Use hard stops deliberately. A hard stop is useful for runaway usage; it is painful when it blocks a developer halfway through legitimate work.

3) One-model policy for all tasks

The same expensive model gets used for tiny edits and hard architecture work.

Fix: route by task class:

low-cost models for routine edits/tests/docs
mid-tier models for normal multi-file implementation
frontier models for ambiguous, high-risk, or architecture-heavy work

GitHub’s targeted model rules (released May 26, 2026) give enterprise admins a platform-level mechanism to enforce this tiering. Setting frontier models to Optional means organizations must explicitly opt in to enable them, rather than having every engineer reach for GPT-5.5 or Claude Opus 4.8 by default. You can also create targeted rules that allow specific models only for selected organizations, which is useful when one team legitimately needs frontier access and others do not.

4) Metric gaming (Goodhart’s Law)

People optimize what is visible and ignore what matters.

Fix: rotate one secondary KPI each quarter and audit behavior, not just numbers.

What good looks like after 90 days

You do not need every metric to improve.

A healthy pattern usually looks like this:

speed improves clearly
quality stays stable or improves slightly
reliability improves slowly (often lags)
AI credits rise, but slower than delivered value

If speed improves 20% and escaped defects double, that is not progress. If AI credit usage rises 30% while lead time drops 25% and rework drops 20%, that is often a good trade.

Minimum viable dashboard

If you only track five metrics this month, track these:

Lead time for change
Review cycle count
Escaped defects per sprint
First-run CI pass rate
AI credits per merged PR

That is enough to answer the one question leadership eventually asks:

Should we scale this, pause this, or change how we use it?

Without KPIs, that decision is politics. With KPIs, it is engineering.

Honest disclaimer: nobody has this fully figured out

Everything above is a working framework, not a settled one.

Usage-based billing for AI coding has barely started in most organisations. GitHub’s AI Credits model launches on June 1, 2026, just days after this post is dated. The first-generation Copilot metrics API has only been GA since October 2024, and the current report-based usage metrics API and aic_* fields are much newer than that. There is no five-year benchmark dataset to point at, no industry-standard scorecard, and no consultancy that has run this playbook through a full annual cycle. Anyone selling you certainty here is selling you something.

What is actually happening, across most engineering orgs right now: people are entering the first month under the new pricing, reading the billing docs for the first time, and trying to figure out which model choices and which workflows will actually be worth the spend. Token optimization is a live research problem, not a solved one. Best practices are being written in public, in real time, by people who are equally in the dark.

That is why the community tooling matters. A few examples worth knowing:

Xebia’s GitHub Copilot Premium Requests Usage Analyzer is a single-page app that Jesse Houwing pointed me to for the billing transition. It takes the CSV export from GitHub’s Billing → Usage page and visualizes it locally: usage over time, distribution by model, compliant vs. quota-exceeding requests. Data stays in your browser, code is open source on github.com/xebia/github-copilot-premium-reqs-usage. The fastest way to see where your premium requests are going without setting up a BI pipeline first.
Xebia’s Copilot Billing Preview is the practical next step for the June 2026 billing switch. Upload the Premium Request Usage report CSV from GitHub’s Billing and licensing → Preview your billing impact page, and it previews what that usage would look like under AI Credits. It is browser-only: the CSV is not uploaded, stored, cached, or sent over the network. Treat the numbers as indicative, not a bill, but it is a useful first check before building your own dashboard.
Rob Bos’s AI Engineering Fluency is a VS Code extension (also for JetBrains and CLI) that reads local session logs from GitHub Copilot, Claude Code, Gemini CLI, Cursor, and a dozen other tools. It shows token usage per model, per editor, per session, and per day, with estimated cost over time. The Fluency Score is the most interesting part: it maps your usage patterns against six dimensions (prompt engineering, context engineering, agentic usage, tool usage, customization, workflow integration) and places each on a four-stage scale from Skeptic to AI Strategist. Works for individual devs checking their own habits and for leads who want to see where the team is stuck at stage one. Free, MIT licensed, all data stays local. Install from the VS Code Marketplace or run npx @rajbos/ai-engineering-fluency stats from the terminal.
GitHub’s own Copilot metrics dashboard: the official in-product surface, available to Business and Enterprise admins.
Vendor admin dashboards for Anthropic, OpenAI, and Cursor (covered above).

None of these tools give you the full picture on their own. The scorecard in this post is my current best attempt to combine them into something a leadership team can act on. Expect it to change. Expect your own version to look different in six months. That is the work.

If you have a better dashboard, a sharper metric, or a war story about a KPI that backfired, I want to hear it. We are all writing this playbook together.

Closing thought

Most teams do not fail with AI coding agents because the models are bad. They fail because they cannot separate speed gains from quality debt and cost noise. By the time finance notices the bill, the conversation is already defensive.

A scorecard is not optional anymore. It is the only thing that lets you say “yes, expand this”, “no, pull this back”, or “change how we use it” without it being a fight.

Keep it boring. Keep it team-level. Tie every number to a decision someone will actually make.

Then you can argue about prompts and model mix all day, because the foundation is no longer the argument.

AI coding agents need KPIs: how to measure speed, quality, reliability, and cost

What the cost post already established

The measurement gap

The KPI model

Speed metrics

Quality metrics

Reliability metrics

Cost metrics

A scorecard you can actually run

30-60-90 rollout plan

Days 1-30: baseline and instrumentation

No baseline? Don’t fake one

Days 31-60: controlled usage patterns

Days 61-90: optimize and decide

Where each metric comes from

How GitHub itself frames this: ESSP, DORA, and the Copilot dashboards

The Engineering System Success Playbook (ESSP)

DORA, the GitHub way

The Copilot dashboards you get out of the box

Suggested wiring

Failure modes to avoid

1) Vanity speed wins

2) Cost panic, then productivity collapse

3) One-model policy for all tasks

4) Metric gaming (Goodhart’s Law)

What good looks like after 90 days

Minimum viable dashboard

Honest disclaimer: nobody has this fully figured out

Closing thought

Sources

Hidde de Smet

Start the conversation

Related

The hidden costs of spec-driven development: when structure helps and when it slows you down

MCP finally got enterprise authorization: here's what changed

GitHub Copilot App vs VS Code Agents Window (2026): Which one should you use?

Pages

Resources

AI coding agents need KPIs: how to measure speed, quality, reliability, and cost

What the cost post already established

The measurement gap

The KPI model

Speed metrics

Quality metrics

Reliability metrics

Cost metrics

A scorecard you can actually run

30-60-90 rollout plan

Days 1-30: baseline and instrumentation

No baseline? Don’t fake one

Days 31-60: controlled usage patterns

Days 61-90: optimize and decide

Where each metric comes from

How GitHub itself frames this: ESSP, DORA, and the Copilot dashboards

The Engineering System Success Playbook (ESSP)

DORA, the GitHub way

The Copilot dashboards you get out of the box

Suggested wiring

Failure modes to avoid

1) Vanity speed wins

2) Cost panic, then productivity collapse

3) One-model policy for all tasks

4) Metric gaming (Goodhart’s Law)

What good looks like after 90 days

Minimum viable dashboard

Honest disclaimer: nobody has this fully figured out

Closing thought

Related reading

Sources

Hidde de Smet

Start the conversation

Related

The hidden costs of spec-driven development: when structure helps and when it slows you down

MCP finally got enterprise authorization: here's what changed

GitHub Copilot App vs VS Code Agents Window (2026): Which one should you use?

Pages

Resources