Andrej Karpathy on Why Agent Skills Fail

Andrej Karpathy’s analysis reveals a significant limitation in relying on agent skills for AI-driven workflows: their struggle to maintain accuracy in complex, multi-step tasks. As outlined by The AI Automators, these skills often rely on probabilistic models, which can result in skipped steps or hallucinated outputs. For instance, in areas like regulatory compliance or medical diagnostics, even small errors can lead to serious consequences, highlighting the need for more reliable approaches. Deterministic harness engineering offers a structured alternative, using mechanisms like validation loops and state tracking to ensure outputs remain precise and dependable.

Dive into this deep dive to understand how harness engineering addresses these challenges through specific features such as context isolation and error correction. Explore real-world examples, including Stripe’s “minions” and Anthropic’s plugins, which showcase the practical application of harnesses in managing intricate workflows. Gain insight into how these methods can be adapted for tasks like financial audits or contract reviews, equipping you with strategies to design AI systems capable of meeting high-stakes demands.

The Reliability Problem in AI

TL;DR Key Takeaways :

High reliability in AI systems is critical for managing complex workflows, but current methodologies like agent skills often fail due to issues like hallucinations, skipped steps and inconsistent outputs.
Agent skills rely on probabilistic models, which lack the precision and reliability needed for high-stakes applications such as financial audits, medical diagnoses and regulatory compliance.
Deterministic harness engineering offers a structured alternative, using frameworks to validate and gate outputs at each stage, making sure errors are corrected before they propagate.
Key features of harness engineering include state tracking, sub-agent delegation, parallel processing, context isolation, validation loops and cost optimization, allowing precision, consistency and scalability.
Harness engineering is evolving with innovative architectures and is poised to play a central role in allowing reliable, scalable AI systems for enterprise applications and critical tasks.

For AI systems to handle critical business operations effectively, they must achieve a level of reliability comparable to traditional software systems. However, this is far from straightforward. Multi-step workflows inherently increase the risk of failure at each stage, creating a cascading effect where small errors can snowball into significant system failures. Even minor inconsistencies can render an AI system unsuitable for tasks that demand precision, such as financial audits, medical diagnoses, or legal contract reviews.

Businesses today demand AI systems that are not only innovative but also dependable and consistent. Without these qualities, the promise of AI-driven automation remains unfulfilled. The challenge lies in designing systems that can consistently deliver accurate results across diverse and complex scenarios, making sure they meet the high standards required for real-world applications.

Agent Skills: Why They’re Not Enough

Agent skills, often implemented as pre-defined prompts or task-specific capabilities, have gained popularity as a method for building AI systems. While they offer flexibility and adaptability, they are inherently flawed in several critical ways. Common issues include hallucinations, skipped steps and inconsistent outputs. These problems become particularly pronounced in large-scale, autonomous operations, where even a single error can disrupt the entire workflow.

The root cause of these shortcomings lies in the reliance of agent skills on probabilistic models, which lack the precision and reliability required for high-stakes applications. For example, in scenarios like regulatory compliance or medical decision-making, even a minor error can have significant consequences. This lack of robustness highlights the need for a more structured and deterministic approach to AI system design.

Enhance your knowledge on AI agents by exploring a selection of articles and guides on the subject.

Harness Engineering: A Structured Alternative

Deterministic harness engineering offers a promising solution to the limitations of agent skills. A harness acts as a structured framework that validates and gates outputs at each stage of a workflow, making sure errors are identified and corrected before they propagate further. By embedding processes directly into the system, harnesses enhance reliability and reduce the risk of failure.

Real-world implementations of harness engineering, such as Stripe’s “minions” and Anthropic’s plugins, demonstrate the effectiveness of this approach. These systems use deterministic processes to manage workflows, providing greater control and predictability. Harnesses are particularly valuable in scenarios where precision and consistency are critical, such as contract reviews, data analysis, or financial reporting.

How Harnesses Are Designed

Harnesses are tailored to meet the specific needs of different workflows, incorporating features that ensure efficiency and reliability. For instance, a harness designed for contract review might include the following components:

State Tracking: Monitors the progress of each task, making sure no steps are missed or repeated.
Sub-Agent Delegation: Assigns isolated tasks to specialized sub-agents, preventing context pollution and improving accuracy.
Parallel Processing: Executes multiple tasks simultaneously, reducing overall processing time and enhancing efficiency.
Context Isolation: Maintains separate contexts for different tasks, preventing interference and making sure clarity in outputs.

These features work in tandem to ensure the system remains focused and efficient, even when managing complex, multi-step workflows.

Key Features of Harness Engineering

To deliver optimal performance, harnesses incorporate several critical features that address the challenges of reliability and scalability:

Planning: Fixed or dynamic plans guide workflows, making sure tasks are executed in the correct sequence and with the necessary precision.
File Systems: Virtual or physical file systems provide reliable mechanisms for data storage and retrieval, making sure consistency across tasks.
Validation Loops: Iterative checks identify and correct errors at each stage, improving the overall quality of outputs.
Memory Management: Combines short-term and long-term memory to retain context, allowing better decision-making and reducing redundancy.
Cost Optimization: Allocates resources efficiently by using simpler models for routine tasks and advanced models for complex orchestration.

These features collectively enable harnesses to deliver the precision, consistency and scalability required for enterprise applications.

Applications and Benefits

Harness engineering enables AI systems to reliably execute long-running, complex tasks. By addressing challenges like context rot and improving observability, harnesses ensure consistent outputs even in dynamic environments. Their modular design and support for parallel processing enhance scalability and efficiency, making them particularly well-suited for enterprise use cases.

For example, a harness-based system could manage a multi-step financial audit by delegating specific tasks to sub-agents, validating each output and maintaining a clear record of progress. This structured approach minimizes errors, ensures compliance with regulatory standards and delivers results that meet the required levels of accuracy and reliability.

The Road Ahead for Harness Engineering

Harness engineering is an evolving discipline with significant potential for innovation. Emerging architectures, such as hierarchical, multi-agent and graph-based designs, are opening new avenues for improving system performance and scalability. Future research is likely to focus on refining key components like validation loops, memory systems and state management to further enhance reliability and efficiency.

As the field continues to advance, harnesses are poised to play a central role in allowing AI systems to meet the demands of real-world applications. By moving beyond the limitations of agent skills, businesses can unlock the full potential of AI, achieving dependable, scalable automation that meets the rigorous demands of modern industries.

Media Credit: The AI Automators

Filed Under: AI, Top News

If you buy something through one of these links, Geeky Gadgets may earn an affiliate commission. Learn about our Disclosure Policy.

Originally Appeared Here

Pages

Categories

Andrej Karpathy on Why Agent Skills Fail

The Reliability Problem in AI

Agent Skills: Why They’re Not Enough

Harness Engineering: A Structured Alternative

How Harnesses Are Designed

Key Features of Harness Engineering

Applications and Benefits

The Road Ahead for Harness Engineering

About the Author:

The Reliability Problem in AI

Agent Skills: Why They’re Not Enough

Harness Engineering: A Structured Alternative

How Harnesses Are Designed

Key Features of Harness Engineering

Applications and Benefits

The Road Ahead for Harness Engineering

You May Also Like

Certinia unveils Veda, its AI engine for Professional Services Automation

The Power of Legal Intake Automation

PIA Automation Launches New Business Segment for Embodied AI and Humanoid Robotics

Automatic.co Releases New Report on AI in Finance and Business Services, Highlighting Shift Toward Autonomous Operations

Kaleris Announces AI-Powered Yard Gate Automation at MODEX 2026

The Future of the Autonomous Enterprise (2026)

About the Author: