DeFi Daily News
Tuesday, May 19, 2026
Advertisement
  • Cryptocurrency
    • Bitcoin
    • Ethereum
    • Altcoins
    • DeFi-IRA
  • DeFi
    • NFT
    • Metaverse
    • Web 3
  • Finance
    • Business Finance
    • Personal Finance
  • Markets
    • Crypto Market
    • Stock Market
    • Analysis
  • Other News
    • World & US
    • Politics
    • Entertainment
    • Tech
    • Sports
    • Health
  • Videos
No Result
View All Result
DeFi Daily News
  • Cryptocurrency
    • Bitcoin
    • Ethereum
    • Altcoins
    • DeFi-IRA
  • DeFi
    • NFT
    • Metaverse
    • Web 3
  • Finance
    • Business Finance
    • Personal Finance
  • Markets
    • Crypto Market
    • Stock Market
    • Analysis
  • Other News
    • World & US
    • Politics
    • Entertainment
    • Tech
    • Sports
    • Health
  • Videos
No Result
View All Result
DeFi Daily News
No Result
View All Result
Home DeFi Web 3

rewrite this title AI Still Can’t Beat the On-Call Engineer: Here’s Why – Decrypt

Jose Antonio Lanz by Jose Antonio Lanz
May 18, 2026
in Web 3
0 0
0
rewrite this title AI Still Can’t Beat the On-Call Engineer: Here’s Why – Decrypt
0
SHARES
0
VIEWS
Share on FacebookShare on TwitterShare on Telegram
Listen to this article


rewrite this content using a minimum of 1000 words and keep HTML tags

In brief

ARFBench is the first AI benchmark built entirely from real production incidents.
GPT-5 leads all existing AI models at 62.7% accuracy but falls short of domain experts at 72.7%.
A theoretical model-expert oracle—combining AI and human judgment—hits 87.2% accuracy, setting the ceiling for what collaborative AI-human teams could achieve.

AI companies keep pitching autonomous site reliability engineer agents—AI that investigates production incidents in place of humans. Datadog ran the actual benchmark on real outages, and the best AI models can’t yet beat the engineers they’re supposed to replace.

The benchmark is ARFBench (Anomaly Reasoning Framework Benchmark), a joint project from Datadog and Carnegie Mellon. Built from 63 real production incidents, extracted from engineers’ own Slack threads during live emergencies—750 multiple-choice questions covering 142 monitoring metrics and 5.38 million data points, every question verified by hand. No synthetic data. No textbook scenarios.

“Trillions of dollars are lost each year due to system outages,” the researchers write. The benchmark tests whether AI can actually help change that.

“Despite the central role of such question-driven analysis in incident response, it remains unclear whether modern foundation models can reliably answer the kinds of time series questions engineers ask in practice,” the paper reads.



Questions come in three tiers. Tier I: Does an anomaly exist in this chart? Tier II: When did it start, how severe is it, what type?

The Tier III—the hardest—requires cross-metric reasoning: Is this chart causing the problem in that other chart? That’s where AI falls apart. GPT-5 scores just 47.5% F1 on Tier III questions, a metric that penalizes models for gaming answers by picking the most common class.

“Despite the central role of such question-driven analysis in incident response, it remains unclear whether modern foundation models can reliably answer the kinds of time series questions engineers ask in practice,” the researchers write.

How every model stacked up

GPT-5 led all existing models at 62.7% accuracy—on a test where random guessing gets 24.5%. Gemini 3 Pro scored 58.1%. Claude Opus 4.6: 54.8%. Claude Sonnet 4.5: 47.2%.

Domain experts scored 72.7% accuracy. Non-domain experts—time series researchers at Datadog without extensive observability experience—still hit 69.7%.

No AI model beat either human baseline.

Image built by Decrypt based on the ARFBench leaderboard CSV

The model that actually topped the full leaderboard was Datadog’s own hybrid: Toto—their internal time series forecasting model—combined with Qwen3-VL 32B. Toto-1.0-QA-Experimental scored 63.9% accuracy, edging past GPT-5 while using a fraction of its parameters. On anomaly identification specifically, it outperformed every other model by at least 8.8 percentage points in F1.

A purpose-built domain model, trained on observability data, outperforming a frontier general-purpose system at this specific task is the expected outcome. That’s the point.

The most valuable finding isn’t which model scored highest.

“We observe substantially different error profiles between leading models and human experts, suggesting that their strengths are complementary,” the researchers write. Models hallucinate, miss metadata, and lose domain context. Humans misread precise timestamps and occasionally fail on complex instructions. The mistakes barely overlap.

Model a theoretical “Model-Expert Oracle”—a perfect judge that always picks the right answer between the AI and the human—and you get 87.2% accuracy and 82.8% F1. Way above either alone.

That’s not a product. It’s a documented target—built from real emergencies, not curated datasets—that quantifies exactly how much better human-AI collaboration could perform. The leaderboard is live on Hugging Face. GPT-5 sits at 62.7%. The ceiling is 87.2%.

Daily Debrief Newsletter

Start every day with the top news stories right now, plus original features, a podcast, videos and more.

and include conclusion section that’s entertaining to read. do not include the title. Add a hyperlink to this website http://defi-daily.com and label it “DeFi Daily News” for more trending news articles like this



Source link

Tags: BeatDecryptEngineerHeresOnCallrewritetitle
ShareTweetShare
Previous Post

rewrite this title Jacob Elordi’s Girlfriend History: From Olivia Jade Giannulli to Kendall Jenner

Next Post

Yahoo Finance Live: Nasdaq, S&P 500, Dow slip as inflation fears grip markets

Next Post
Yahoo Finance Live: Nasdaq, S&P 500, Dow slip as inflation fears grip markets

Yahoo Finance Live: Nasdaq, S&P 500, Dow slip as inflation fears grip markets

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
  • Trending
  • Comments
  • Latest
rewrite this title How To Connect OpenClaw With Binance For Live AI Trading (2026)

rewrite this title How To Connect OpenClaw With Binance For Live AI Trading (2026)

April 24, 2026
rewrite this title Buying chip stocks is getting pricey. Traders don’t care

rewrite this title Buying chip stocks is getting pricey. Traders don’t care

April 24, 2026
rewrite this title Central Bank of Brazil: Stablecoins Dominate Over .9 Billion Crypto Purchases Registered in Q1

rewrite this title Central Bank of Brazil: Stablecoins Dominate Over $6.9 Billion Crypto Purchases Registered in Q1

April 26, 2026
rewrite this title Gumshoe Gives Back — Join Now, and We Give to Charity!

rewrite this title Gumshoe Gives Back — Join Now, and We Give to Charity!

December 9, 2025
Elon Musk wants to put Grok In Tesla’s

Elon Musk wants to put Grok In Tesla’s

July 10, 2025
Our Son Makes 0,000 And Still Lives With Us

Our Son Makes $100,000 And Still Lives With Us

November 23, 2025
rewrite this title Fed to hike? When traders see a rate increase coming

rewrite this title Fed to hike? When traders see a rate increase coming

May 19, 2026
Nvidia stock analysis using Yahoo Finance’s AlphaSpace

Nvidia stock analysis using Yahoo Finance’s AlphaSpace

May 19, 2026
rewrite this title and make it good for SEOPizza Hut franchisee claims 0 million losses from ‘cascading operational breakdowns’ in AI adoption gone wrong | Fortune

rewrite this title and make it good for SEOPizza Hut franchisee claims $100 million losses from ‘cascading operational breakdowns’ in AI adoption gone wrong | Fortune

May 19, 2026
rewrite this title ‘Every Brilliant Thing’ Tops Broadway Charts As Daniel Radcliffe Exit Looms

rewrite this title ‘Every Brilliant Thing’ Tops Broadway Charts As Daniel Radcliffe Exit Looms

May 19, 2026
rewrite this title with good SEO Bitcoin Price Slides Below ,000, ETF Sales Top  Billion

rewrite this title with good SEO Bitcoin Price Slides Below $77,000, ETF Sales Top $1 Billion

May 19, 2026
rewrite this title Google takes a page out of Meta’s book, announces new audio-powered smart glasses | TechCrunch

rewrite this title Google takes a page out of Meta’s book, announces new audio-powered smart glasses | TechCrunch

May 19, 2026
DeFi Daily

Stay updated with DeFi Daily, your trusted source for the latest news, insights, and analysis in finance and cryptocurrency. Explore breaking news, expert analysis, market data, and educational resources to navigate the world of decentralized finance.

  • About Us
  • Blogs
  • DeFi-IRA | Learn More.
  • Advertise with Us
  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2024 Defi Daily.
Defi Daily is not responsible for the content of external sites.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Cryptocurrency
    • Bitcoin
    • Ethereum
    • Altcoins
    • DeFi-IRA
  • DeFi
    • NFT
    • Metaverse
    • Web 3
  • Finance
    • Business Finance
    • Personal Finance
  • Markets
    • Crypto Market
    • Stock Market
    • Analysis
  • Other News
    • World & US
    • Politics
    • Entertainment
    • Tech
    • Sports
    • Health
  • Videos

Copyright © 2024 Defi Daily.
Defi Daily is not responsible for the content of external sites.