Reasoning Models Hallucinate More — Marking Trouble for AI Agent Adoption

Reasoning AI models unlock powerful autonomy for agents, but the risk of hallucinations looms large. Explore how to ensure greater accuracy and reliability with RAG for agents.

Earlier this month, The New York Timesspotlighted a counterintuitive finding: reasoning models like OpenAI’s o3, o4 mini or DeepSeek R1 are significantly more prone to hallucinations than their base model counterparts, such as GPT-4o or DeepSeek-V3. This is a dire warning for agent enthusiasts and adopters: reasoning LLMs are better suited for powering autonomous agents than their chat-oriented predecessors, but the risk of hallucinations in agentic contexts is much higher than in chat-based RAG apps.

Agentic AI doesn’t just answer questions, it makes decisions, triggers APIs, files tickets, and moves money. A stray fabrication that might merely embarrass a customer-support bot can cascade into a multi-step automation gone catastrophically wrong.

The upshot is clear: if we want agents to operate in any consequential (and therefore valuable) environment, their reasoning engines must be strapped to a Retrieval-Augmented Generation (RAG) knowledge & trust layer.


From Chatbots to Agents—the Risk Curve Goes Vertical

AI agents are quickly gaining adoption, with the number of agent pilots doubling from Q4 ’24 to Q1 ’25, and 99% of organizations planning to deploy agents (KPMG study). This makes a lot of sense – AI agents unlock the ROI of generative AI, changing the paradigm from helping you do work to doing work for you via automated, semi/independent action. The massive promise of this tech is mirrored by a massive risk curve.  

The core of GenAI systems, including agents, are large language models (LLMs). LLMs are not search engines, despite how they’re often used; they have a tendency to confidently state entirely made-up claims. This behavior is called hallucination. A conversational bot that answers a customer with a made-up statistic is embarrassing and potentially costly. An autonomous agent that spins that hallucinated statistic into a downstream SQL query, triggers a workflow, and files a compliance report is catastrophic.  

The very properties that make agentic systems valuable—long-horizon planning, tool use, and self-directed action—also amplify every upstream factual error. Reasoning models are the natural ‘brain’ for AI agents, but as the new benchmarks show, they suffer the greatest propensity to lie.


Why Reasoning Models Are Ideal for Agentic AI

Reasoning-optimized large language models (LLMs) do more than predict the next token. They unpack a goal into a sequence of decisions, select the right tools, and adjust mid-flight when reality contradicts expectations. To perform these steps properly, AI agent need:

  1. Long-horizon planning: Reasoning models provide chain-of-thought generation breaking complex tasks into discrete, ordered steps the agent can execute.

  1. Dynamic tool selection: Reasoning models provide function-calling interfaces to let the model inspect available APIs and decide, at run time, which one best advances its plan.

  1. Context-aware adaptation: Reasoning models provide built-in reflection loops to allow the model to critique interim results (“Did that SQL query return the fields I need?”) and pivot if the answer is no.

  1. Error recovery: When a sub-step fails, reasoning models can reason backward, identify why, and propose alternatives rather than aborting the whole job.

In short, reasoning models give an agent a cognitive whiteboard: they don’t just say things—they decide what to do next, why it matters, and whether the outcome is good enough to continue. That’s the essence of agency. The trade-off, as highlighted in the NYT article, is a higher propensity to hallucinate. It would be like if the top strategist at your firm were also a prolific, pathological liar. Core reasoning competencies are exactly what’s needed to make good on the promise of AI agents, but the drawback of “making stuff up” is simply unacceptable.  


RAG to the Rescue

Retrieval-Augmented Generation (RAG) inserts a retrieval step between the user (or agent) prompt and the generation phase, forcing the model to ground its answers in external, authoritative sources. Due to another feature of LLMs called in-context learning, the model will bias its output in favor of any user-provided context, meaning that finding the correct answer and providing it to the model can nearly entirely mitigate hallucinations.  

OpenAI reports that GPT-4o with live web-search, effectively a lightweight RAG system, jumps to 90% accuracy on the SimpleQA benchmark. This makes intuitive sense, you effectively ‘open book’ any question. At Pryon, we have found as high as 99% accuracy on client content when their GenAI systems are hooked up to a resilient and performant RAG system connected to great, trustworthy content.  

Integrating a RAG layer to ground the planning, reasoning, conversation, and actions of all your AI agents can allow builders to use these reasoning models in their agents without having to worry about hallucinations.  


Why “Agent-Grade” RAG Is Hard

Grounding a single answer is table stakes and can often be effectively built with a variety of open-source tools, if the use case is tightly scoped. An agent, however, needs to ground hundreds of micro-decisions per task run while coping with:

1. Adaptive Context Windows: The agent’s information need evolves every step.

  • Upgrade to RAG stack: Dynamic query rewriting + on-the-fly re-ranking

2. Tool Chains & Code Execution: Generated code must target real APIs, not hallucinated ones.

  • Upgrade to RAG stack: API schema retrieval + function-call validation

3. Latency Constraints: Agents loop; slow retrieval kills throughput.

  • Upgrade to RAG stack: Hierarchical caching & local vector stores
  1. Scale: Org-wide agents hit millions of retrieval calls/day.
  • Upgrade to RAG stack: Sharded indexes, hybrid search

5. Autonomy Without Human Gut-Checks: Human-in-the-loop is unavailable mid-loop.

  • Upgrade to RAG stack: Confidence scoring, citation voting, auto-abstain policies

6. Audit & Compliance: Every action must be explainable.

  • Upgrade to RAG stack: Confidence scoring, immutable provenance logs & signed citations


Designing a Next-Gen RAG Pipeline for Agents

Given the power of RAG for agents to mitigate hallucinations, building a RAG pipeline specifically for agentic AI — and doing it right — is vital. Here’s what organizations building this next-gen pipeline need to consider:

  1. Multimodal Ingestion. OCR, Handwritten Text Recognition, Table Structure Recognition, layout analysis, and speech transcription ensure the agent can grab any enterprise artifact—from scanned contracts to call recordings.
  1. Hybrid Retrieval. Combine sparse keyword, dense vector, structured search, web search, and metadata filtering, adaptively, at scale. The agent selects the strategy at runtime based on query type, maximizing recall without flooding the context window (and without driving up your token costs).
  1. Plan-Aware Retrieval. Expose the agent’s intermediate plan to the retriever so it can pre-fetch documents for upcoming steps, hiding latency and driving better planning and reasoning with retrieval augmentation.  
  1. Self-Verification Loop. After draft generation, the agent re-queries for counterevidence. If contradictions exceed a threshold, it reruns or escalates.
  1. Citations as First-Class Data. Each retrieved chunk carries a hash and security label; the agent’s action executor refuses to proceed unless every critical field is backed by a high-confidence citation.
  1. Streaming Index Updates. For autonomous agents running 24/7, the RAG index must ingest fresh records continuously. Staleness is a hidden vector for hallucinations.
  1. Governance Hooks. Provide real-time telemetry on retrieval hit rates, citation density, and hallucination flags so ops teams can halt or throttle misbehaving agents.


Retrieval is a Competitive Advantage

AI agents promise a massive transformation in the way we work. The ability for organizations to access the intelligence of these models in a way that actually makes sense for their business is critical.  

The effective deployment of AI agents will be a competitive differentiator for organizations. Concerningly, the reasoning models best positioned to drive the intelligence of these agents are also prone to making them unusable in any valuable context.  

At Pryon, we firmly believe that organizations can, and should, have their cake and eat it too — use the highest-end state-of-the-art models specifically designed for agentic AI, without having to worry about hallucinations. But this is dependent on organizations getting retrieval right. We believe this is the highest priority strategic imperative for organizations to prioritize in 2025.

Reasoning Models Hallucinate More — Marking Trouble for AI Agent Adoption

Reasoning AI models unlock powerful autonomy for agents, but the risk of hallucinations looms large. Explore how to ensure greater accuracy and reliability with RAG for agents.

Earlier this month, The New York Timesspotlighted a counterintuitive finding: reasoning models like OpenAI’s o3, o4 mini or DeepSeek R1 are significantly more prone to hallucinations than their base model counterparts, such as GPT-4o or DeepSeek-V3. This is a dire warning for agent enthusiasts and adopters: reasoning LLMs are better suited for powering autonomous agents than their chat-oriented predecessors, but the risk of hallucinations in agentic contexts is much higher than in chat-based RAG apps.

Agentic AI doesn’t just answer questions, it makes decisions, triggers APIs, files tickets, and moves money. A stray fabrication that might merely embarrass a customer-support bot can cascade into a multi-step automation gone catastrophically wrong.

The upshot is clear: if we want agents to operate in any consequential (and therefore valuable) environment, their reasoning engines must be strapped to a Retrieval-Augmented Generation (RAG) knowledge & trust layer.


From Chatbots to Agents—the Risk Curve Goes Vertical

AI agents are quickly gaining adoption, with the number of agent pilots doubling from Q4 ’24 to Q1 ’25, and 99% of organizations planning to deploy agents (KPMG study). This makes a lot of sense – AI agents unlock the ROI of generative AI, changing the paradigm from helping you do work to doing work for you via automated, semi/independent action. The massive promise of this tech is mirrored by a massive risk curve.  

The core of GenAI systems, including agents, are large language models (LLMs). LLMs are not search engines, despite how they’re often used; they have a tendency to confidently state entirely made-up claims. This behavior is called hallucination. A conversational bot that answers a customer with a made-up statistic is embarrassing and potentially costly. An autonomous agent that spins that hallucinated statistic into a downstream SQL query, triggers a workflow, and files a compliance report is catastrophic.  

The very properties that make agentic systems valuable—long-horizon planning, tool use, and self-directed action—also amplify every upstream factual error. Reasoning models are the natural ‘brain’ for AI agents, but as the new benchmarks show, they suffer the greatest propensity to lie.


Why Reasoning Models Are Ideal for Agentic AI

Reasoning-optimized large language models (LLMs) do more than predict the next token. They unpack a goal into a sequence of decisions, select the right tools, and adjust mid-flight when reality contradicts expectations. To perform these steps properly, AI agent need:

  1. Long-horizon planning: Reasoning models provide chain-of-thought generation breaking complex tasks into discrete, ordered steps the agent can execute.

  1. Dynamic tool selection: Reasoning models provide function-calling interfaces to let the model inspect available APIs and decide, at run time, which one best advances its plan.

  1. Context-aware adaptation: Reasoning models provide built-in reflection loops to allow the model to critique interim results (“Did that SQL query return the fields I need?”) and pivot if the answer is no.

  1. Error recovery: When a sub-step fails, reasoning models can reason backward, identify why, and propose alternatives rather than aborting the whole job.

In short, reasoning models give an agent a cognitive whiteboard: they don’t just say things—they decide what to do next, why it matters, and whether the outcome is good enough to continue. That’s the essence of agency. The trade-off, as highlighted in the NYT article, is a higher propensity to hallucinate. It would be like if the top strategist at your firm were also a prolific, pathological liar. Core reasoning competencies are exactly what’s needed to make good on the promise of AI agents, but the drawback of “making stuff up” is simply unacceptable.  


RAG to the Rescue

Retrieval-Augmented Generation (RAG) inserts a retrieval step between the user (or agent) prompt and the generation phase, forcing the model to ground its answers in external, authoritative sources. Due to another feature of LLMs called in-context learning, the model will bias its output in favor of any user-provided context, meaning that finding the correct answer and providing it to the model can nearly entirely mitigate hallucinations.  

OpenAI reports that GPT-4o with live web-search, effectively a lightweight RAG system, jumps to 90% accuracy on the SimpleQA benchmark. This makes intuitive sense, you effectively ‘open book’ any question. At Pryon, we have found as high as 99% accuracy on client content when their GenAI systems are hooked up to a resilient and performant RAG system connected to great, trustworthy content.  

Integrating a RAG layer to ground the planning, reasoning, conversation, and actions of all your AI agents can allow builders to use these reasoning models in their agents without having to worry about hallucinations.  


Why “Agent-Grade” RAG Is Hard

Grounding a single answer is table stakes and can often be effectively built with a variety of open-source tools, if the use case is tightly scoped. An agent, however, needs to ground hundreds of micro-decisions per task run while coping with:

1. Adaptive Context Windows: The agent’s information need evolves every step.

  • Upgrade to RAG stack: Dynamic query rewriting + on-the-fly re-ranking

2. Tool Chains & Code Execution: Generated code must target real APIs, not hallucinated ones.

  • Upgrade to RAG stack: API schema retrieval + function-call validation

3. Latency Constraints: Agents loop; slow retrieval kills throughput.

  • Upgrade to RAG stack: Hierarchical caching & local vector stores
  1. Scale: Org-wide agents hit millions of retrieval calls/day.
  • Upgrade to RAG stack: Sharded indexes, hybrid search

5. Autonomy Without Human Gut-Checks: Human-in-the-loop is unavailable mid-loop.

  • Upgrade to RAG stack: Confidence scoring, citation voting, auto-abstain policies

6. Audit & Compliance: Every action must be explainable.

  • Upgrade to RAG stack: Confidence scoring, immutable provenance logs & signed citations


Designing a Next-Gen RAG Pipeline for Agents

Given the power of RAG for agents to mitigate hallucinations, building a RAG pipeline specifically for agentic AI — and doing it right — is vital. Here’s what organizations building this next-gen pipeline need to consider:

  1. Multimodal Ingestion. OCR, Handwritten Text Recognition, Table Structure Recognition, layout analysis, and speech transcription ensure the agent can grab any enterprise artifact—from scanned contracts to call recordings.
  1. Hybrid Retrieval. Combine sparse keyword, dense vector, structured search, web search, and metadata filtering, adaptively, at scale. The agent selects the strategy at runtime based on query type, maximizing recall without flooding the context window (and without driving up your token costs).
  1. Plan-Aware Retrieval. Expose the agent’s intermediate plan to the retriever so it can pre-fetch documents for upcoming steps, hiding latency and driving better planning and reasoning with retrieval augmentation.  
  1. Self-Verification Loop. After draft generation, the agent re-queries for counterevidence. If contradictions exceed a threshold, it reruns or escalates.
  1. Citations as First-Class Data. Each retrieved chunk carries a hash and security label; the agent’s action executor refuses to proceed unless every critical field is backed by a high-confidence citation.
  1. Streaming Index Updates. For autonomous agents running 24/7, the RAG index must ingest fresh records continuously. Staleness is a hidden vector for hallucinations.
  1. Governance Hooks. Provide real-time telemetry on retrieval hit rates, citation density, and hallucination flags so ops teams can halt or throttle misbehaving agents.


Retrieval is a Competitive Advantage

AI agents promise a massive transformation in the way we work. The ability for organizations to access the intelligence of these models in a way that actually makes sense for their business is critical.  

The effective deployment of AI agents will be a competitive differentiator for organizations. Concerningly, the reasoning models best positioned to drive the intelligence of these agents are also prone to making them unusable in any valuable context.  

At Pryon, we firmly believe that organizations can, and should, have their cake and eat it too — use the highest-end state-of-the-art models specifically designed for agentic AI, without having to worry about hallucinations. But this is dependent on organizations getting retrieval right. We believe this is the highest priority strategic imperative for organizations to prioritize in 2025.

No items found.

Reasoning Models Hallucinate More — Marking Trouble for AI Agent Adoption

Reasoning AI models unlock powerful autonomy for agents, but the risk of hallucinations looms large. Explore how to ensure greater accuracy and reliability with RAG for agents.

Earlier this month, The New York Timesspotlighted a counterintuitive finding: reasoning models like OpenAI’s o3, o4 mini or DeepSeek R1 are significantly more prone to hallucinations than their base model counterparts, such as GPT-4o or DeepSeek-V3. This is a dire warning for agent enthusiasts and adopters: reasoning LLMs are better suited for powering autonomous agents than their chat-oriented predecessors, but the risk of hallucinations in agentic contexts is much higher than in chat-based RAG apps.

Agentic AI doesn’t just answer questions, it makes decisions, triggers APIs, files tickets, and moves money. A stray fabrication that might merely embarrass a customer-support bot can cascade into a multi-step automation gone catastrophically wrong.

The upshot is clear: if we want agents to operate in any consequential (and therefore valuable) environment, their reasoning engines must be strapped to a Retrieval-Augmented Generation (RAG) knowledge & trust layer.


From Chatbots to Agents—the Risk Curve Goes Vertical

AI agents are quickly gaining adoption, with the number of agent pilots doubling from Q4 ’24 to Q1 ’25, and 99% of organizations planning to deploy agents (KPMG study). This makes a lot of sense – AI agents unlock the ROI of generative AI, changing the paradigm from helping you do work to doing work for you via automated, semi/independent action. The massive promise of this tech is mirrored by a massive risk curve.  

The core of GenAI systems, including agents, are large language models (LLMs). LLMs are not search engines, despite how they’re often used; they have a tendency to confidently state entirely made-up claims. This behavior is called hallucination. A conversational bot that answers a customer with a made-up statistic is embarrassing and potentially costly. An autonomous agent that spins that hallucinated statistic into a downstream SQL query, triggers a workflow, and files a compliance report is catastrophic.  

The very properties that make agentic systems valuable—long-horizon planning, tool use, and self-directed action—also amplify every upstream factual error. Reasoning models are the natural ‘brain’ for AI agents, but as the new benchmarks show, they suffer the greatest propensity to lie.


Why Reasoning Models Are Ideal for Agentic AI

Reasoning-optimized large language models (LLMs) do more than predict the next token. They unpack a goal into a sequence of decisions, select the right tools, and adjust mid-flight when reality contradicts expectations. To perform these steps properly, AI agent need:

  1. Long-horizon planning: Reasoning models provide chain-of-thought generation breaking complex tasks into discrete, ordered steps the agent can execute.

  1. Dynamic tool selection: Reasoning models provide function-calling interfaces to let the model inspect available APIs and decide, at run time, which one best advances its plan.

  1. Context-aware adaptation: Reasoning models provide built-in reflection loops to allow the model to critique interim results (“Did that SQL query return the fields I need?”) and pivot if the answer is no.

  1. Error recovery: When a sub-step fails, reasoning models can reason backward, identify why, and propose alternatives rather than aborting the whole job.

In short, reasoning models give an agent a cognitive whiteboard: they don’t just say things—they decide what to do next, why it matters, and whether the outcome is good enough to continue. That’s the essence of agency. The trade-off, as highlighted in the NYT article, is a higher propensity to hallucinate. It would be like if the top strategist at your firm were also a prolific, pathological liar. Core reasoning competencies are exactly what’s needed to make good on the promise of AI agents, but the drawback of “making stuff up” is simply unacceptable.  


RAG to the Rescue

Retrieval-Augmented Generation (RAG) inserts a retrieval step between the user (or agent) prompt and the generation phase, forcing the model to ground its answers in external, authoritative sources. Due to another feature of LLMs called in-context learning, the model will bias its output in favor of any user-provided context, meaning that finding the correct answer and providing it to the model can nearly entirely mitigate hallucinations.  

OpenAI reports that GPT-4o with live web-search, effectively a lightweight RAG system, jumps to 90% accuracy on the SimpleQA benchmark. This makes intuitive sense, you effectively ‘open book’ any question. At Pryon, we have found as high as 99% accuracy on client content when their GenAI systems are hooked up to a resilient and performant RAG system connected to great, trustworthy content.  

Integrating a RAG layer to ground the planning, reasoning, conversation, and actions of all your AI agents can allow builders to use these reasoning models in their agents without having to worry about hallucinations.  


Why “Agent-Grade” RAG Is Hard

Grounding a single answer is table stakes and can often be effectively built with a variety of open-source tools, if the use case is tightly scoped. An agent, however, needs to ground hundreds of micro-decisions per task run while coping with:

1. Adaptive Context Windows: The agent’s information need evolves every step.

  • Upgrade to RAG stack: Dynamic query rewriting + on-the-fly re-ranking

2. Tool Chains & Code Execution: Generated code must target real APIs, not hallucinated ones.

  • Upgrade to RAG stack: API schema retrieval + function-call validation

3. Latency Constraints: Agents loop; slow retrieval kills throughput.

  • Upgrade to RAG stack: Hierarchical caching & local vector stores
  1. Scale: Org-wide agents hit millions of retrieval calls/day.
  • Upgrade to RAG stack: Sharded indexes, hybrid search

5. Autonomy Without Human Gut-Checks: Human-in-the-loop is unavailable mid-loop.

  • Upgrade to RAG stack: Confidence scoring, citation voting, auto-abstain policies

6. Audit & Compliance: Every action must be explainable.

  • Upgrade to RAG stack: Confidence scoring, immutable provenance logs & signed citations


Designing a Next-Gen RAG Pipeline for Agents

Given the power of RAG for agents to mitigate hallucinations, building a RAG pipeline specifically for agentic AI — and doing it right — is vital. Here’s what organizations building this next-gen pipeline need to consider:

  1. Multimodal Ingestion. OCR, Handwritten Text Recognition, Table Structure Recognition, layout analysis, and speech transcription ensure the agent can grab any enterprise artifact—from scanned contracts to call recordings.
  1. Hybrid Retrieval. Combine sparse keyword, dense vector, structured search, web search, and metadata filtering, adaptively, at scale. The agent selects the strategy at runtime based on query type, maximizing recall without flooding the context window (and without driving up your token costs).
  1. Plan-Aware Retrieval. Expose the agent’s intermediate plan to the retriever so it can pre-fetch documents for upcoming steps, hiding latency and driving better planning and reasoning with retrieval augmentation.  
  1. Self-Verification Loop. After draft generation, the agent re-queries for counterevidence. If contradictions exceed a threshold, it reruns or escalates.
  1. Citations as First-Class Data. Each retrieved chunk carries a hash and security label; the agent’s action executor refuses to proceed unless every critical field is backed by a high-confidence citation.
  1. Streaming Index Updates. For autonomous agents running 24/7, the RAG index must ingest fresh records continuously. Staleness is a hidden vector for hallucinations.
  1. Governance Hooks. Provide real-time telemetry on retrieval hit rates, citation density, and hallucination flags so ops teams can halt or throttle misbehaving agents.


Retrieval is a Competitive Advantage

AI agents promise a massive transformation in the way we work. The ability for organizations to access the intelligence of these models in a way that actually makes sense for their business is critical.  

The effective deployment of AI agents will be a competitive differentiator for organizations. Concerningly, the reasoning models best positioned to drive the intelligence of these agents are also prone to making them unusable in any valuable context.  

At Pryon, we firmly believe that organizations can, and should, have their cake and eat it too — use the highest-end state-of-the-art models specifically designed for agentic AI, without having to worry about hallucinations. But this is dependent on organizations getting retrieval right. We believe this is the highest priority strategic imperative for organizations to prioritize in 2025.

Ready to see Pryon in action?

Request a demo.