Academic papers on AI agent reliability + ITSM relevance (May 2026)
This page summarizes the May 11 2026 academic-research pass. Source report: /tmp/academic-papers-agent-reliability-2026-05-11.md (407 lines, 44KB). 23 paper cards with Class A arXiv verification.
Top-10 papers for Init Intelligence
1. ITBench (IBM + UIUC, ICML 2025) — direct Init Intelligence proof point
- Exact match for SRE / CISO / FinOps wedge.
- SoTA models resolve only 13.8% / 25.2% / 0% of tasks (note: initial search returned 11.4%/25.2%/25.8% but HTML render confirms 13.8%/25.2%/0% — wiki uses the verified numbers).
- The 0% on FinOps is the most-cited number in the deep-research pass: it indicates there is no model-only solution for the Init Intelligence wedge today.
2. AIOpsLab (Microsoft + IBM, MLSys 2025)
- Coins “AgentOps” — verbatim Init Intelligence framing.
- Microsoft+IBM joint authorship = both Microsoft and ServiceNow-adjacent IBM are publishing on this primitive.
3. τ-bench / τ²-Bench (Sierra, NeurIPS 2024 / June 2025)
- Introduces pass^k consistency metric (consistency across k runs of the same task).
- pass^8 < 25% in retail.
- GPT-4 pass@1 drops 74-56% → 34% in dual-control telecom.
- pass^k measures k-run consistency; single-pass accuracy is higher than k-run consistency.
4. METR Time Horizons (NeurIPS 2025)
- 50%-task-completion horizon doubles every ~7 months.
- At this compounding rate, today’s models still cannot do 4-hour autonomous tasks reliably.
5. Beyond Accuracy / CLEAR (Nov 2025)
- 60% → 25% drop across 8-run consistency on enterprise tasks.
- CLEAR framework directly applicable to managed-service SLAs.
6-10. Honorable mentions
- SWE-Bench Illusion — quantifies 6-32% inflation of SWE-bench scores due to contamination. Tracking SWE-bench Verified is not sufficient for capability claims.
- Why Do Multi-Agent LLM Systems Fail? — MAST failure taxonomy.
- OSWorld / OSWorld-Human — desktop-agent benchmark with human-validated trajectories.
- Apollo Scheming — in-context-scheming evals.
- MCP Landscape / MCP Safety Audit — formal analyses of the MCP protocol that’s now category-default. See oss-agent-infra-2026.
Findings consensus across 23 papers
- Single-shot scores systematically overstate production reliability. Every paper with multi-run consistency analysis (τ-bench, CLEAR, OSWorld-Human) shows substantial drops when measuring k-run consistency vs single-pass.
- Tool-use is the dominant failure surface (not reasoning, not knowledge).
- SWE-bench family is contaminated (6-32% inflation depending on metric).
- Long-horizon execution is the bottleneck — but single-step accuracy gains compound exponentially when models scale.
- Enterprise deployment needs governance separate from the model — see agent-tool-governance and oss-agent-infra-2026.
Open research gaps
- Dual-control IT-helpdesk benchmark. ITBench is close but not exactly the AI ITSM customer task.
- CLEAR-conformant managed-service SLAs. A formal framework for SLA-grade k-run consistency in production.
- Customer-specific contamination-controlled SWE benchmarks. Verify per-customer that the model has not seen analog data.
- Trajectory-hygiene techniques for runbook executors. How to clean noisy tool traces before learning from them.
- Production-grade telemetry sanitization. Removing PII from agent traces while preserving learnability.
- MAST-style failure taxonomy specific to IT-ops. Adapting the multi-agent failure modes to the IT runbook domain.
- In-context-scheming evals for narrowly-scoped agents. Apollo scheming generalizes; whether narrow IT agents inherit the same behaviors is unknown.
Verification discipline
- Every cited card has Class A (arXiv abstract URL fetched in this session).
- Class B/C strong for all top-10 (institutional blogs, OpenReview, venue pages, cross-citations).
- §9 (Human-AI handoff) in source report is explicitly disclosed as the thinnest section.
- 10 papers that surfaced in searches but were not separately fetched are listed in source report’s “cards not promoted” appendix.
Related
- oss-agent-infra-2026 — engineering stack +
ITSM-benchproposal - agent-tool-governance
- microsoft · openai · anthropic
- Init Intelligence