Mitigating LLM Hallucinations with Fine-Grained Attribution

How Pryon’s in-house LLM combines fine-grained attribution with RAG to deliver more trustworthy generative responses

Authors

Kasturi Bhattacharjee, Steven Rennie, and Vaibhava Goel are with Pryon Research.

Large language models (LLMs) have recently demonstrated remarkable performance on a wide range of natural language processing (NLP) tasks and beyond. This has led to their ubiquitous use in many areas, including chatbots, virtual assistants, content generators, coding assistants, language translators, and more.

While exhibiting a tremendous ability to generate content, interpret instructions, and reason, LLMs tend to hallucinate. That is, they can generate content that is inconsistent with real-world facts or inputs [1-4]. This raises questions about the reliability and trustworthiness of LLMs when used in real-world settings and has been the subject of much debate and fanfare both in academic circles and the mainstream media.

Retrieval-augmented generation (RAG)

Retrieval-augmented generation (RAG) has emerged as an important way to mitigate hallucinations [6-10]. With RAG, information from relevant sources is first retrieved before an AI application generates a response. By conditioning their responses on the information sources, RAG-based applications can generate more accurate and reliable answers. 

However, RAG alone is not enough to eliminate hallucinations. LLMs empowered by RAG can still hallucinate facts, even when the information retrieved to support response generation is correct. These incorrect claims can be very difficult to detect, as they are often highly plausible, and stated with the fluidity and confidence of a subject matter expert [5]. To be able to completely trust any response generated by an LLM, we ultimately need to understand where the information in the response was derived from.

Source attribution

To help LLM users understand the origins of generated responses, LLMs can be asked to attribute or cite their sources as they generate. Attributing claims back to their source comes with some major advantages: 

  1. It encourages the LLM to be faithful to passages it cites, thereby improving correctness & reducing hallucinations [11-12].
  2. It empowers users by enabling them to understand, scrutinize, and contextualize generated claims, thereby increasing trust in the application and allowing them to properly interpret the results.

The benefits of fine-grained attribution

Many LLM deployments now attribute generated responses to sources, but this is typically done at the passage level. This has the major limitation that the generation cannot easily be verified, even when the generation is correct, and the citations are accurate. 

Pryon’s latest generative LLMs, in contrast, support the fine-grained attribution of claims to sources. Our attribution process is deeply integrated into the generator, so that fine-grained attribution not only boosts interpretability, but also fundamentally improves both response generation and passage-level attribution. This makes Pryon’s generative LLMs suitable for even the most sensitive information-critical applications.

Let’s walk through a quick example to highlight the limitations of traditional attribution and the advantages of the fine-grained approach.

Consider the following question: 

Question: What areas did Beyonce compete in when she was growing up?

The following passages are retrieved for RAG:

  • [1] Beyoncé Giselle Knowles-Carter (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child....
  • [2] At age eight, Beyoncé and childhood friend Kelly Rowland met LaTavia Roberson while in an audition for an all-girl entertainment group. They were placed into a group with three other girls as Girl's Tyme, and rapped and danced on the talent show circuit in Houston. After seeing the group, R&B producer Arne Frager brought them to his Northern California studio and placed them in Star Search, the largest talent show on national TV at the time…
  • [3] Beyoncé attended St. Mary's Elementary School in Fredericksburg, Texas, where she enrolled in dance classes.... In fall of 1990, Beyoncé enrolled in Parker Elementary School, a music magnet school in Houston, where she would perform with the school's choir.  Beyoncé was also a member of the choir at St. John's United Methodist Church as a soloist for two years.

A standard RAG system produces the following result:

Beyoncé competed in singing and dancing competitions as a child, attended a music magnet school, and was a member of her school's choir and church choir [1][3].

In this case, both the answer and passage attributions are correct, but it is not easy to verify that this is the case. The user would have to read the passages to confirm the result, which is generally very tedious, as many more passages are typically utilized when employing RAG. 

Let’s now look at the answer generated when we do fine-grained attribution:

Beyoncé competed in singing and dancing competitions as a child [1], attended a music magnet school [3], and was a member of her school's choir and church choir [3].

References:

  • [1] Beyoncé Giselle Knowles-Carter (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child....
  • [2] At age eight, Beyoncé and childhood friend Kelly Rowland met LaTavia Roberson while in an audition for an all-girl entertainment group. They were placed into a group with three other girls as Girl's Tyme, and rapped and danced on the talent show circuit in Houston. After seeing the group, R&B producer Arne Frager brought them to his Northern California studio and placed them in Star Search, the largest talent show on national TV at the time…
  • [3] Beyoncé attended St. Mary's Elementary School in Fredericksburg, Texas, where she enrolled in dance classes.... In fall of 1990, Beyoncé enrolled in Parker Elementary School, a music magnet school in Houston, where she would perform with the school's choir. Beyoncé was also a member of the choir at St. John's United Methodist Church as a soloist for two years.

In contrast with standard RAG, fine-grained attributions have been extracted from the source passages, with the most relevant source text highlighted. In contrast with standard attribution methods, here the integrity of the generated response can be verified much faster. The full response gives pointed, valuable context to users to support efficiency and accuracy in their investigation and decision processes.

Comparing Pryon’s attribution to GPT-3.5 & GPT-4 

The table below compares the passage-level citation quality of Pryon’s in-house LLM with GPT-3.5 & GPT-4. Pryon’s system utilizes proprietary fine-grained attribution approaches similar to those discussed above, but here we compute attribution performance at passage-level to compare Pryon with typical deployments.

Results are obtained on a Pryon-internal dataset consisting of questions from 4 domains, and an external, publicly available dataset [15]. Citation precision and recall metrics, as well as the attribution rate are adapted from past work [11-14]. Citation Recall computes if the generated claim is entirely supported by all the accompanying citations [11], while Citation Precision and Attribution Rate are computed based on the largest subset of citations that is determined to support each generated claim.


ModelDatasetCitation PrecisionCitation RecallAttribution Rate
GPT-3.577.874.181.5
GPT-4Pryon internal80.977.885.2
Pryon88.082.493.5
GPT-3.591.090.891.1
GPT-4External dataset (HAGRID [15])91.591.491.5
Pryon92.992.793.3


As evident from the table above, Pryon’s in-house LLM outperforms the GPT models we evaluated, providing a significant lift in passage-level attribution accuracy. Crucially, Pryon’s in-house LLM also returns fine-grained attributions, which make our systems suitable for even the most demanding information-critical applications. 

Active investigation continues in this extremely important and exciting area, as we earnestly strive toward generative systems that are free of hallucinations and attribution errors.

References

  1. Huang et. al., A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions, arXiv 2023
  2. Ji et. al., Survey of Hallucination in Natural Language Generation, ACM Computing Surveys, Vol 55, Issue 12, March 2023
  3. Liu et. al., Exploring and Evaluating Hallucinations in LLM-Powered Code Generation, arXiv 2024
  4. Li et. al., HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models, EMNLP 2023
  5. Slater, Joe, et al. ChatGPT Isn’t ‘hallucinating’ - It’s Just Churning out BS, LiveScience, July 2024
  6. Ma et al., Query Rewriting for Retrieval-Augmented Large Language Models, EMNLP 2023
  7. Shuster et. al., Retrieval Augmentation Reduces Hallucination in Conversation, ACL 2021
  8. Gao et al., Retrieval-Augmented Generation for Large Language Models: A Survey, arXiv 2024. 
  9. Chen et al., Benchmarking Large Language Models in Retrieval-Augmented Generation, AAAI 2024.
  10. Omrani et al., Hybrid Retrieval-Augmented Generation Approach for LLMs Query Response Enhancement, ICWR 2024
  11. Gao et. al., Enabling Large Language Models to Generate Text with Citations, EMNLP 2023
  12. Yue et. al., Automatic Evaluation of Attribution by Large Language Models, EMNLP 2023
  13. Gao et al., RARR: Researching and Revising What Language Models Say, Using Language Models. ACL 2023.
  14. Honovich et al., TRUE: Re-evaluating Factual Consistency Evaluation. ACL Dialdoc Workshop 2022.
  15. Kamalloo et. al., HAGRID: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution