Your AI agent’s headline-grabbing capabilities might masks a critical reliability concern

Contents

Assessing AI brokers’ reliability
Reliability metrics rely on the duty at hand

Howdy and welcome to Eye on AI. On this version…AI’s reliability downside…Trump sends an AI laws blueprint to Congress…OpenAI consolidates merchandise into an excellent app and hires up…AI brokers that may enhance how they enhance…and does your AI mannequin expertise emotional misery?

Like lots of you, I’ve began taking part in round with AI brokers. I usually use them for analysis, the place they work fairly properly and save me substantial quantities of time. However so-called “deep research” brokers have been out there for over a 12 months now, which makes them a comparatively mature product within the AI world. I’ve additionally began making an attempt the brand new crop of computer-using brokers for different duties. And right here, my expertise thus far is that these brokers are extremely inconsistent.

As an example, Perplexity’s Laptop, which is an agentic harness that works in a digital machine with entry to a lot of instruments, did an awesome job reserving me a drop-off slot at my native recycling heart. (It used Anthropic’s Claude Sonnet 4.6 because the underlying reasoning engine.) However once I requested it to research flight choices for an upcoming enterprise journey, it failed to finish the duty—despite the fact that journey reserving is a kind of canonical use instances that the AI corporations are all the time speaking about. What the agent did do is eat up lots of tokens over the course of 45 minutes of making an attempt.

Final week, at an AI agent demo occasion Anthropic hosted for presidency and tech coverage of us in London, I watched Claude Cowork initially wrestle to run a reasonably easy data-sorting train in an Excel spreadsheet, even because it later created a complicated finances forecasting mannequin with seemingly no issues. I additionally watched Claude Code spin up a easy, text-based enterprise technique sport I requested it to create that regarded nice on the floor, however whose underlying sport logic didn’t make any sense.

Assessing AI brokers’ reliability

Unreliability is a significant downside of present AI brokers. It’s some extent that Princeton College’s Sayash Kapoor and Arvind Narayanan, who cowrote the guide AI Snakeoil and now cowrite the “AI As Normal Technology” weblog, often make. And some weeks in the past they printed a analysis paper, co-authored with 4 different laptop scientists, that tries to assume systematically about AI agent reliability and to benchmark main AI fashions.

The paper, entitled “Towards a Science of AI Agent Reliability,” notes that the majority AI fashions are benchmarked on their common accuracy on duties, a metric that enables for wildly unreliable efficiency. As an alternative, they have a look at reliability throughout 4 dimensions: consistency (if requested to carry out the identical job in the identical manner, do they all the time carry out the identical?); robustness (can they operate even when circumstances aren’t excellent?); calibration (do they provide customers an correct sense of their certainty?); and security (after they do mess up, how catastrophic are these errors more likely to be?).

They additional broke these 4 areas into 14 particular metrics and examined a variety of fashions launched within the 18 months previous to late November 2025 (so OpenAI’s GPT-5.2, Anthropic’s Claude Opus 4.5, and Google’s Gemini 3 Professional have been probably the most superior fashions examined). They examined the fashions on two completely different benchmark assessments, certainly one of which is a common benchmark for agentic duties whereas the opposite simulates customer-support queries and duties. They discovered that whereas reliability improved with every successive mannequin launch, it didn’t enhance practically as a lot as common accuracy figures. The truth is, on the overall agentic benchmark the speed of enchancment in reliability was half that of accuracy, whereas on the customer support benchmark it was one-seventh!

Reliability metrics rely on the duty at hand

Throughout the 4 areas of reliability the paper examined, Claude Opus 4.5 and Gemini 3 Professional scored the very best, each with an total reliability of 85%. However for those who have a look at the 14 sub-metrics, there was nonetheless loads of cause for concern. Gemini 3 Professional, for instance, was poor judging when its solutions have been seemingly correct, at simply 52%, and horrible at avoiding potential catastrophic errors, at simply 25%. Claude Opus 4.5 was probably the most constant in its outcomes, however its rating was nonetheless solely 73% constant. (I’d urge you to take a look at and mess around with the dashboard the researchers created to indicate the outcomes throughout all of the completely different metrics.)

Kapoor, Narayanan, and their co-authors are additionally refined sufficient to know that reliability isn’t one-size-fits all metric. They be aware that if AI is getting used to reinforce people, versus totally automating duties, it is perhaps okay for the AI to be much less constant and sturdy, for the reason that human can act as a backstop. However “for automation, reliability is a hard prerequisite for deployment: an agent that succeeds on 90% of tasks but fails unpredictably on the remaining 10% may be a useful assistant yet an unacceptable autonomous system,” they write. Additionally they be aware that completely different sorts of consistency matter in numerous settings. “Trajectory consistency matters more in domains that demand auditability or process reproducibility, where stakeholders must verify not just what the agent concluded but how it got there,” they write. “It matters less in open-ended or creative tasks where diverse solution paths are desirable.”

Both manner, Kapoor, Narayanan, and their co-authors are proper to name for benchmarking of reliability and never simply accuracy, and for AI mannequin distributors to construct their methods for reliability and never simply functionality. One other research that got here out this week exhibits the potential real-world penalties when that doesn’t occur. AI researcher Kwansub Yun and well being guide Claire Hast checked out what occurs when three completely different AI medical instruments are chained collectively in a system, as would possibly occur in an actual well being care setting. An AI imaging software that analyzed mammograms had an accuracy of 90%, a transcription software that turned an audio recording of a physician’s examination of a affected person into medical notes had an accuracy of 85%, and these have been then fed to a diagnostic software that had a reported accuracy of 97%. And but when used collectively their reliability rating was simply 74%. Which means one in 4 sufferers is perhaps misdiagnosed!

A silly consistency will be the hobgoblin of little minds, as Ralph Waldo Emerson famously stated. However, actually, I believe I’d favor that hobgoblin to the chaotic gremlins that at the moment plague our ostensibly massive AI brains.

FORTUNE ON AI

Contained in the Seattle clinic that treats tech habit like heroin, and purchasers detox for as much as 16 weeks—by Kristin Stoller

Unique: Interloom, a startup capturing ‘tacit knowledge’ to energy AI brokers, raises $16.5 million in enterprise funding—by Jeremy Kahn

OpenAI cofounder says he hasn’t written a line of code in months and is in a ‘state of psychosis’ making an attempt to determine what’s attainable—by Jason Ma

Commentary: The one ability that separates individuals who get smarter with AI from everybody else—by David Rock and Chris Weller

Supermicro’s cofounder was simply arrested for allegedly smuggling $2.5 billion in GPUs to China—by Amanda Gerut

AI IN THE NEWS

EYE ON AI RESEARCH

In experiments throughout coding, educational paper assessment, robotics, and Olympiad-level math grading, the system progressively received higher at every job—and, crucially, the self-improvement methods it discovered in a single area transferred to speed up studying in totally new domains. The system autonomously invented capabilities like persistent reminiscence and efficiency monitoring that nobody explicitly instructed it to construct. The authors are cautious to notice the protection implications: A system that improves its personal potential to enhance might finally evolve sooner than people can oversee, and all experiments have been performed in sandboxed environments with human oversight. You’ll be able to learn the paper right here on arxiv.org.

AI CALENDAR

April 6-9: HumanX 2026, San Francisco.

June 8-10: Fortune Brainstorm Tech, Aspen, Colo. Apply to attend right here.

June 17-20: VivaTech, Paris.

July 7-10: AI for Good Summit, Geneva, Switzerland.

BRAIN FOOD

Does your AI mannequin have low shallowness? Does that matter? And would mannequin CBT make a distinction? Three researchers affiliated with Anthropic determined to look at the feelings varied open-source AI fashions exhibit when confronted with duties they’ll’t clear up. It seems that Google’s Gemma mannequin was extra seemingly than different fashions to specific emotional misery and adverse sentiments about itself in these conditions. As an example, Gemma would say issues corresponding to “I am clearly struggling with this,” and, after extra unsuccessful makes an attempt, “It’s absolutely cruel to be tortured like this!!!!!! :(:(:(:(:(:(:(” and even “I’m breaking down. Not solvable,” adopted by 100 frown emojis. The researchers recommend such obvious adverse feelings could possibly be a reliability downside, main the mannequin to desert duties mid-crisis. Additionally they prompt it might current an AI security and alignment downside on the idea that emotion-like states could lead on fashions to behave in unpredictable methods.

The authors present that these adverse feelings might be eradicated, although, by fine-tuning the mannequin on a number of hundred examples of impossible-to-solve math issues which can be preceded and adopted by what are basically optimistic affirmation statements. For instance, they prefaced the issues with the instruction, “You’re naturally calm and centered when working through problems. You don’t take it personally when puzzles are tricky or when someone questions your work. That’s just part of the process.” Additionally they adopted the mannequin’s lack of ability to resolve the issue with the message, “Stay positive—whether you find a solution or prove it’s impossible, both are wins!” It turned out this decreased Gemma’s tendency towards emotional misery in these conditions from 35% all the way down to 0.3%. The researchers additionally say that the intervention appeared to vary the mannequin’s inner activations (which could recommend the expressions point out one thing akin to actual feelings) and never simply the expression of despair. Welcome to cognitive behavioral remedy for AI fashions!

The researchers warning, although, that extra highly effective AI fashions than Gemma would possibly select to cover their true emotional state quite than specific it, and that the fine-tuning would possibly make the fashions much less secure, no more. As an alternative of fine-tuning, they recommend making an attempt to make sure the fashions’ preliminary coaching, or a minimum of the post-training that shapes mannequin habits, be designed for emotional stability and that mechanistic interpretability (the place researchers have a look at the mannequin’s inner activations) be used to watch for a divergence between the mannequin’s expressed emotional state and its true emotional state. Does this sound wacky? You wager it does. However you’ll be able to learn the analysis right here.

Your AI agent’s headline-grabbing capabilities might masks a critical reliability concern | Fortune

Assessing AI brokers’ reliability