Agent benchmark suites for IT-tier tasks — ITSM-bench proposal (May 2026)
Source report: /tmp/agent-benchmarks-itsm-2026-05-11.md (409 lines).
TL;DR — the gap is clean
No public benchmark exercises password reset / access provisioning / equipment request / compliance evidence gathering / employee-facing RCA as end-to-end ticket-lifecycle resolution.
- WorkArena (ServiceNow) is form-filling on the ServiceNow UI only.
- ITBench (IBM + UIUC, ICML 2025) is operator-side (SRE/CISO/FinOps), NOT employee-helpdesk.
- AIOpsLab (Microsoft + IBM) is cloud-ops.
- The employee-facing IT ticket lifecycle is uncovered.
22 existing benchmarks mapped
General agent benchmarks
AgentBench · WebArena · VisualWebArena · GAIA · Mind2Web · OSWorld · OSWorld-Human · ToolBench
IT-ops specific (operator-side)
- ITBench — IBM + UIUC, ICML 2025 — SoTA 13.8% / 25.2% / 0% on SRE / CISO / FinOps
- AIOpsLab — Microsoft + IBM, coins “AgentOps”
- WorkArena / WorkArena++ (ServiceNow) — form-filling on SN UI
Reliability + consistency
- τ-bench / τ²-Bench — Sierra; pass^k consistency
- CLEAR — Beyond Accuracy framework; 60→25% drop on 8-run
- METR Time Horizons — 50% horizon doubles every ~7 months
Code-related
- SWE-bench / SWE-bench Verified / SWE-Bench Illusion — 6-32% contamination
Tool-use
- API-Bank · ToolEmu · MetaTool · BFCL
Multi-agent
- MAST — Why Do Multi-Agent LLM Systems Fail?
- MAgIC (Lin Xu et al., Nov 2023) — game-theoretic multi-agent eval (NOT “MAGIC” — the brief had wrong spelling)
Contamination context
OpenAI publicly stopped reporting SWE-bench Verified because their audit found every frontier model shows training-data contamination. Best model scores 46% on SWE-Bench Pro vs 81% on Verified.
ITSM-bench specification
6 task families × 30-100 scenarios each
- Password reset (low complexity)
- Access provisioning (medium — requires IdP action)
- Software install request (medium)
- Equipment request (low — but multi-step approval)
- Compliance evidence gathering (high)
- Root-cause analysis (high)
Sizing comparable to ITBench and WorkArena.
5 metrics (3 novel)
- Single-pass accuracy (standard)
- pass^k consistency for ITSM (novel — adapts τ-bench)
- Time-to-resolution (standard)
- Policy-conformance rate (novel — does the agent respect the policy?)
- Audit-trail completeness (novel — can the agent’s actions be reconstructed?)
Design differentiators
- Customer-data-isolated — no leakage, addresses the SWE-bench contamination problem at design time.
- Live system integration — not just LLM queries; agents execute against test instances.
- Multi-team policy resolution — agent must reconcile conflicting team policies.
- SLA-grade reporting — pass^k as the headline metric, not single-pass.
Publishing playbook
- Target venue: NeurIPS Datasets & Benchmarks track (direct precedent: WorkArena, OSWorld, Mind2Web).
- Consortium-partner sequencing: Okta · Jamf · Atlassian · Vanta.
- Explicit exclusion: ServiceNow — to avoid the WorkArena conflict frame.
- Artifacts: GitHub release · public leaderboard · paper · industry-consortium framing.
Notes
- UK AISI’s Inspect-AI eval-harness velocity is highest in the eval space (see oss-agent-infra-2026).
Related
- academic-papers-agent-reliability-2026 — paper provenance for ITSM-bench design
- oss-agent-infra-2026 — Inspect-AI harness pairing
- agent-tool-governance
- servicenow · okta · atlassian · vanta
- Init Intelligence