Academic papers on AI agent reliability + ITSM relevance (May 2026)

This page summarizes the May 11 2026 academic-research pass. Source report: /tmp/academic-papers-agent-reliability-2026-05-11.md (407 lines, 44KB). 23 paper cards with Class A arXiv verification.

Top-10 papers for Init Intelligence

1. ITBench (IBM + UIUC, ICML 2025) — direct Init Intelligence proof point

ITBench paper.

Exact match for SRE / CISO / FinOps wedge.
SoTA models resolve only 13.8% / 25.2% / 0% of tasks (note: initial search returned 11.4%/25.2%/25.8% but HTML render confirms 13.8%/25.2%/0% — wiki uses the verified numbers).
The 0% on FinOps is the most-cited number in the deep-research pass: it indicates there is no model-only solution for the Init Intelligence wedge today.

2. AIOpsLab (Microsoft + IBM, MLSys 2025)

AIOpsLab paper.

Coins “AgentOps” — verbatim Init Intelligence framing.
Microsoft+IBM joint authorship = both Microsoft and ServiceNow-adjacent IBM are publishing on this primitive.

3. τ-bench / τ²-Bench (Sierra, NeurIPS 2024 / June 2025)

τ-bench / τ²-Bench.

Introduces pass^k consistency metric (consistency across k runs of the same task).
pass^8 < 25% in retail.
GPT-4 pass@1 drops 74-56% → 34% in dual-control telecom.
pass^k measures k-run consistency; single-pass accuracy is higher than k-run consistency.

4. METR Time Horizons (NeurIPS 2025)

METR Time Horizons paper.

50%-task-completion horizon doubles every ~7 months.
At this compounding rate, today’s models still cannot do 4-hour autonomous tasks reliably.

5. Beyond Accuracy / CLEAR (Nov 2025)

Beyond Accuracy CLEAR paper.

60% → 25% drop across 8-run consistency on enterprise tasks.
CLEAR framework directly applicable to managed-service SLAs.

6-10. Honorable mentions

SWE-Bench Illusion — quantifies 6-32% inflation of SWE-bench scores due to contamination. Tracking SWE-bench Verified is not sufficient for capability claims.
Why Do Multi-Agent LLM Systems Fail? — MAST failure taxonomy.
OSWorld / OSWorld-Human — desktop-agent benchmark with human-validated trajectories.
Apollo Scheming — in-context-scheming evals.
MCP Landscape / MCP Safety Audit — formal analyses of the MCP protocol that’s now category-default. See oss-agent-infra-2026.

Findings consensus across 23 papers

Single-shot scores systematically overstate production reliability. Every paper with multi-run consistency analysis (τ-bench, CLEAR, OSWorld-Human) shows substantial drops when measuring k-run consistency vs single-pass.
Tool-use is the dominant failure surface (not reasoning, not knowledge).
SWE-bench family is contaminated (6-32% inflation depending on metric).
Long-horizon execution is the bottleneck — but single-step accuracy gains compound exponentially when models scale.
Enterprise deployment needs governance separate from the model — see agent-tool-governance and oss-agent-infra-2026.

Open research gaps

Dual-control IT-helpdesk benchmark. ITBench is close but not exactly the AI ITSM customer task.
CLEAR-conformant managed-service SLAs. A formal framework for SLA-grade k-run consistency in production.
Customer-specific contamination-controlled SWE benchmarks. Verify per-customer that the model has not seen analog data.
Trajectory-hygiene techniques for runbook executors. How to clean noisy tool traces before learning from them.
Production-grade telemetry sanitization. Removing PII from agent traces while preserving learnability.
MAST-style failure taxonomy specific to IT-ops. Adapting the multi-agent failure modes to the IT runbook domain.
In-context-scheming evals for narrowly-scoped agents. Apollo scheming generalizes; whether narrow IT agents inherit the same behaviors is unknown.

Verification discipline

Every cited card has Class A (arXiv abstract URL fetched in this session).
Class B/C strong for all top-10 (institutional blogs, OpenReview, venue pages, cross-citations).
§9 (Human-AI handoff) in source report is explicitly disclosed as the thinnest section.
10 papers that surfaced in searches but were not separately fetched are listed in source report’s “cards not promoted” appendix.

oss-agent-infra-2026 — engineering stack + ITSM-bench proposal
agent-tool-governance
microsoft · openai · anthropic
Init Intelligence

Init Intelligence Atlas

Contents

Academic papers on AI agent reliability + ITSM relevance (May 2026)

Academic papers on AI agent reliability + ITSM relevance (May 2026)

Top-10 papers for Init Intelligence

1. ITBench (IBM + UIUC, ICML 2025) — direct Init Intelligence proof point

2. AIOpsLab (Microsoft + IBM, MLSys 2025)

3. τ-bench / τ²-Bench (Sierra, NeurIPS 2024 / June 2025)

4. METR Time Horizons (NeurIPS 2025)

5. Beyond Accuracy / CLEAR (Nov 2025)

6-10. Honorable mentions

Findings consensus across 23 papers

Open research gaps

Verification discipline

Graph View

Table of Contents

Backlinks

Init Intelligence Atlas

Contents

Academic papers on AI agent reliability + ITSM relevance (May 2026)

Academic papers on AI agent reliability + ITSM relevance (May 2026)

Top-10 papers for Init Intelligence

1. ITBench (IBM + UIUC, ICML 2025) — direct Init Intelligence proof point

2. AIOpsLab (Microsoft + IBM, MLSys 2025)

3. τ-bench / τ²-Bench (Sierra, NeurIPS 2024 / June 2025)

4. METR Time Horizons (NeurIPS 2025)

5. Beyond Accuracy / CLEAR (Nov 2025)

6-10. Honorable mentions

Findings consensus across 23 papers

Open research gaps

Verification discipline

Related

Graph View

Table of Contents

Backlinks