Agent benchmark suites for IT-tier tasks — ITSM-bench proposal (May 2026)

Source report: /tmp/agent-benchmarks-itsm-2026-05-11.md (409 lines).

TL;DR — the gap is clean

No public benchmark exercises password reset / access provisioning / equipment request / compliance evidence gathering / employee-facing RCA as end-to-end ticket-lifecycle resolution.

  • WorkArena (ServiceNow) is form-filling on the ServiceNow UI only.
  • ITBench (IBM + UIUC, ICML 2025) is operator-side (SRE/CISO/FinOps), NOT employee-helpdesk.
  • AIOpsLab (Microsoft + IBM) is cloud-ops.
  • The employee-facing IT ticket lifecycle is uncovered.

22 existing benchmarks mapped

General agent benchmarks

AgentBench · WebArena · VisualWebArena · GAIA · Mind2Web · OSWorld · OSWorld-Human · ToolBench

IT-ops specific (operator-side)

  • ITBench — IBM + UIUC, ICML 2025 — SoTA 13.8% / 25.2% / 0% on SRE / CISO / FinOps
  • AIOpsLab — Microsoft + IBM, coins “AgentOps”
  • WorkArena / WorkArena++ (ServiceNow) — form-filling on SN UI

Reliability + consistency

Tool-use

  • API-Bank · ToolEmu · MetaTool · BFCL

Multi-agent

  • MAST — Why Do Multi-Agent LLM Systems Fail?
  • MAgIC (Lin Xu et al., Nov 2023) — game-theoretic multi-agent eval (NOT “MAGIC” — the brief had wrong spelling)

Contamination context

OpenAI publicly stopped reporting SWE-bench Verified because their audit found every frontier model shows training-data contamination. Best model scores 46% on SWE-Bench Pro vs 81% on Verified.

ITSM-bench specification

6 task families × 30-100 scenarios each

  1. Password reset (low complexity)
  2. Access provisioning (medium — requires IdP action)
  3. Software install request (medium)
  4. Equipment request (low — but multi-step approval)
  5. Compliance evidence gathering (high)
  6. Root-cause analysis (high)

Sizing comparable to ITBench and WorkArena.

5 metrics (3 novel)

  1. Single-pass accuracy (standard)
  2. pass^k consistency for ITSM (novel — adapts τ-bench)
  3. Time-to-resolution (standard)
  4. Policy-conformance rate (novel — does the agent respect the policy?)
  5. Audit-trail completeness (novel — can the agent’s actions be reconstructed?)

Design differentiators

  • Customer-data-isolated — no leakage, addresses the SWE-bench contamination problem at design time.
  • Live system integration — not just LLM queries; agents execute against test instances.
  • Multi-team policy resolution — agent must reconcile conflicting team policies.
  • SLA-grade reporting — pass^k as the headline metric, not single-pass.

Publishing playbook

  • Target venue: NeurIPS Datasets & Benchmarks track (direct precedent: WorkArena, OSWorld, Mind2Web).
  • Consortium-partner sequencing: Okta · Jamf · Atlassian · Vanta.
  • Explicit exclusion: ServiceNow — to avoid the WorkArena conflict frame.
  • Artifacts: GitHub release · public leaderboard · paper · industry-consortium framing.

Notes

  • UK AISI’s Inspect-AI eval-harness velocity is highest in the eval space (see oss-agent-infra-2026).