Academic papers on AI agent reliability + ITSM relevance (May 2026)

This page summarizes the May 11 2026 academic-research pass. Source report: /tmp/academic-papers-agent-reliability-2026-05-11.md (407 lines, 44KB). 23 paper cards with Class A arXiv verification.

Top-10 papers for Init Intelligence

1. ITBench (IBM + UIUC, ICML 2025) — direct Init Intelligence proof point

ITBench paper.

  • Exact match for SRE / CISO / FinOps wedge.
  • SoTA models resolve only 13.8% / 25.2% / 0% of tasks (note: initial search returned 11.4%/25.2%/25.8% but HTML render confirms 13.8%/25.2%/0% — wiki uses the verified numbers).
  • The 0% on FinOps is the most-cited number in the deep-research pass: it indicates there is no model-only solution for the Init Intelligence wedge today.

2. AIOpsLab (Microsoft + IBM, MLSys 2025)

AIOpsLab paper.

3. τ-bench / τ²-Bench (Sierra, NeurIPS 2024 / June 2025)

τ-bench / τ²-Bench.

  • Introduces pass^k consistency metric (consistency across k runs of the same task).
  • pass^8 < 25% in retail.
  • GPT-4 pass@1 drops 74-56% → 34% in dual-control telecom.
  • pass^k measures k-run consistency; single-pass accuracy is higher than k-run consistency.

4. METR Time Horizons (NeurIPS 2025)

METR Time Horizons paper.

  • 50%-task-completion horizon doubles every ~7 months.
  • At this compounding rate, today’s models still cannot do 4-hour autonomous tasks reliably.

5. Beyond Accuracy / CLEAR (Nov 2025)

Beyond Accuracy CLEAR paper.

  • 60% → 25% drop across 8-run consistency on enterprise tasks.
  • CLEAR framework directly applicable to managed-service SLAs.

6-10. Honorable mentions

Findings consensus across 23 papers

  1. Single-shot scores systematically overstate production reliability. Every paper with multi-run consistency analysis (τ-bench, CLEAR, OSWorld-Human) shows substantial drops when measuring k-run consistency vs single-pass.
  2. Tool-use is the dominant failure surface (not reasoning, not knowledge).
  3. SWE-bench family is contaminated (6-32% inflation depending on metric).
  4. Long-horizon execution is the bottleneck — but single-step accuracy gains compound exponentially when models scale.
  5. Enterprise deployment needs governance separate from the model — see agent-tool-governance and oss-agent-infra-2026.

Open research gaps

  1. Dual-control IT-helpdesk benchmark. ITBench is close but not exactly the AI ITSM customer task.
  2. CLEAR-conformant managed-service SLAs. A formal framework for SLA-grade k-run consistency in production.
  3. Customer-specific contamination-controlled SWE benchmarks. Verify per-customer that the model has not seen analog data.
  4. Trajectory-hygiene techniques for runbook executors. How to clean noisy tool traces before learning from them.
  5. Production-grade telemetry sanitization. Removing PII from agent traces while preserving learnability.
  6. MAST-style failure taxonomy specific to IT-ops. Adapting the multi-agent failure modes to the IT runbook domain.
  7. In-context-scheming evals for narrowly-scoped agents. Apollo scheming generalizes; whether narrow IT agents inherit the same behaviors is unknown.

Verification discipline

  • Every cited card has Class A (arXiv abstract URL fetched in this session).
  • Class B/C strong for all top-10 (institutional blogs, OpenReview, venue pages, cross-citations).
  • §9 (Human-AI handoff) in source report is explicitly disclosed as the thinnest section.
  • 10 papers that surfaced in searches but were not separately fetched are listed in source report’s “cards not promoted” appendix.