Agent benchmark suites for IT-tier tasks — ITSM-bench proposal (May 2026)

Source report: /tmp/agent-benchmarks-itsm-2026-05-11.md (409 lines).

TL;DR — the gap is clean

No public benchmark exercises password reset / access provisioning / equipment request / compliance evidence gathering / employee-facing RCA as end-to-end ticket-lifecycle resolution.

WorkArena (ServiceNow) is form-filling on the ServiceNow UI only.
ITBench (IBM + UIUC, ICML 2025) is operator-side (SRE/CISO/FinOps), NOT employee-helpdesk.
AIOpsLab (Microsoft + IBM) is cloud-ops.
The employee-facing IT ticket lifecycle is uncovered.

22 existing benchmarks mapped

General agent benchmarks

AgentBench · WebArena · VisualWebArena · GAIA · Mind2Web · OSWorld · OSWorld-Human · ToolBench

IT-ops specific (operator-side)

ITBench — IBM + UIUC, ICML 2025 — SoTA 13.8% / 25.2% / 0% on SRE / CISO / FinOps
AIOpsLab — Microsoft + IBM, coins “AgentOps”
WorkArena / WorkArena++ (ServiceNow) — form-filling on SN UI

Reliability + consistency

τ-bench / τ²-Bench — Sierra; pass^k consistency
CLEAR — Beyond Accuracy framework; 60→25% drop on 8-run
METR Time Horizons — 50% horizon doubles every ~7 months

SWE-bench / SWE-bench Verified / SWE-Bench Illusion — 6-32% contamination

Tool-use

API-Bank · ToolEmu · MetaTool · BFCL

Multi-agent

MAST — Why Do Multi-Agent LLM Systems Fail?
MAgIC (Lin Xu et al., Nov 2023) — game-theoretic multi-agent eval (NOT “MAGIC” — the brief had wrong spelling)

Contamination context

OpenAI publicly stopped reporting SWE-bench Verified because their audit found every frontier model shows training-data contamination. Best model scores 46% on SWE-Bench Pro vs 81% on Verified.

ITSM-bench specification

6 task families × 30-100 scenarios each

Password reset (low complexity)
Access provisioning (medium — requires IdP action)
Software install request (medium)
Equipment request (low — but multi-step approval)
Compliance evidence gathering (high)
Root-cause analysis (high)

Sizing comparable to ITBench and WorkArena.

5 metrics (3 novel)

Single-pass accuracy (standard)
pass^k consistency for ITSM (novel — adapts τ-bench)
Time-to-resolution (standard)
Policy-conformance rate (novel — does the agent respect the policy?)
Audit-trail completeness (novel — can the agent’s actions be reconstructed?)

Design differentiators

Customer-data-isolated — no leakage, addresses the SWE-bench contamination problem at design time.
Live system integration — not just LLM queries; agents execute against test instances.
Multi-team policy resolution — agent must reconcile conflicting team policies.
SLA-grade reporting — pass^k as the headline metric, not single-pass.

Publishing playbook

Target venue: NeurIPS Datasets & Benchmarks track (direct precedent: WorkArena, OSWorld, Mind2Web).
Consortium-partner sequencing: Okta · Jamf · Atlassian · Vanta.
Explicit exclusion: ServiceNow — to avoid the WorkArena conflict frame.
Artifacts: GitHub release · public leaderboard · paper · industry-consortium framing.

Notes

UK AISI’s Inspect-AI eval-harness velocity is highest in the eval space (see oss-agent-infra-2026).

academic-papers-agent-reliability-2026 — paper provenance for ITSM-bench design
oss-agent-infra-2026 — Inspect-AI harness pairing
agent-tool-governance
servicenow · okta · atlassian · vanta
Init Intelligence

Init Intelligence Atlas

Contents

Agent benchmark suites for IT-tier tasks — ITSM-bench proposal (May 2026)

Agent benchmark suites for IT-tier tasks — ITSM-bench proposal (May 2026)

TL;DR — the gap is clean

22 existing benchmarks mapped

General agent benchmarks

IT-ops specific (operator-side)

Reliability + consistency

Tool-use

Multi-agent

Contamination context

ITSM-bench specification

6 task families × 30-100 scenarios each

5 metrics (3 novel)

Design differentiators

Publishing playbook

Notes

Graph View

Table of Contents

Backlinks

Init Intelligence Atlas

Contents

Agent benchmark suites for IT-tier tasks — ITSM-bench proposal (May 2026)

Agent benchmark suites for IT-tier tasks — ITSM-bench proposal (May 2026)

TL;DR — the gap is clean

22 existing benchmarks mapped

General agent benchmarks

IT-ops specific (operator-side)

Reliability + consistency

Code-related

Tool-use

Multi-agent

Contamination context

ITSM-bench specification

6 task families × 30-100 scenarios each

5 metrics (3 novel)

Design differentiators

Publishing playbook

Notes

Related

Graph View

Table of Contents

Backlinks