Combating LLM Hallucinations with Fine-Grained Attribution

Large language models (LLMs) have recently demonstrated remarkable performance on a wide range of natural language processing (NLP) tasks and beyond. This has led to their ubiquitous use in many areas, including chatbots, virtual assistants, content generators, coding assistants, language translators, and more.

While exhibiting a tremendous ability to generate content, interpret instructions, and reason, LLMs tend to hallucinate. That is, they can generate content that is inconsistent with real-world facts or inputs [1-4]. This raises questions about the reliability and trustworthiness of LLMs when used in real-world settings and has been the subject of much debate and fanfare both in academic circles and the mainstream media.

‍

Retrieval-augmented generation (RAG)

Retrieval-augmented generation (RAG) has emerged as an important way to mitigate hallucinations [6-10]. With RAG, information from relevant sources is first retrieved before an AI application generates a response. By conditioning their responses on the information sources, RAG-based applications can generate more accurate and reliable answers.

However, RAG alone is not enough to eliminate hallucinations. LLMs empowered by RAG can still hallucinate facts, even when the information retrieved to support response generation is correct. These incorrect claims can be very difficult to detect, as they are often highly plausible, and stated with the fluidity and confidence of a subject matter expert [5]. To be able to completely trust any response generated by an LLM, we ultimately need to understand where the information in the response was derived from.

‍

Source attribution

To help LLM users understand the origins of generated responses, LLMs can be asked to attribute or cite their sources as they generate. Attributing claims back to their source comes with some major advantages:

It encourages the LLM to be faithful to passages it cites, thereby improving correctness & reducing hallucinations [11-12].
It empowers users by enabling them to understand, scrutinize, and contextualize generated claims, thereby increasing trust in the application and allowing them to properly interpret the results.

‍

The benefits of fine-grained attribution

Many LLM deployments now attribute generated responses to sources, but this is typically done at the passage level. This has the major limitation that the generation cannot easily be verified, even when the generation is correct, and the citations are accurate.

Pryon’s latest generative LLMs, in contrast, support the fine-grained attribution of claims to sources. Our attribution process is deeply integrated into the generator, so that fine-grained attribution not only boosts interpretability, but also fundamentally improves both response generation and passage-level attribution. This makes Pryon’s generative LLMs suitable for even the most sensitive information-critical applications.

Let’s walk through a quick example to highlight the limitations of traditional attribution and the advantages of the fine-grained approach.

‍

Consider the following question:

Question: What areas did Beyonce compete in when she was growing up?

The following passages are retrieved for RAG:

[1] Beyoncé Giselle Knowles-Carter (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child....
[2] At age eight, Beyoncé and childhood friend Kelly Rowland met LaTavia Roberson while in an audition for an all-girl entertainment group. They were placed into a group with three other girls as Girl's Tyme, and rapped and danced on the talent show circuit in Houston. After seeing the group, R&B producer Arne Frager brought them to his Northern California studio and placed them in Star Search, the largest talent show on national TV at the time…
[3] Beyoncé attended St. Mary's Elementary School in Fredericksburg, Texas, where she enrolled in dance classes.... In fall of 1990, Beyoncé enrolled in Parker Elementary School, a music magnet school in Houston, where she would perform with the school's choir. Beyoncé was also a member of the choir at St. John's United Methodist Church as a soloist for two years.

‍

A standard RAG system produces the following result:

‍

Beyoncé competed in singing and dancing competitions as a child, attended a music magnet school, and was a member of her school's choir and church choir [1][3].

‍

In this case, both the answer and passage attributions are correct, but it is not easy to verify that this is the case. The user would have to read the passages to confirm the result, which is generally very tedious, as many more passages are typically utilized when employing RAG.

‍

Let’s now look at the answer generated when we do fine-grained attribution:

‍

Beyoncé competed in singing and dancing competitions as a child [1], attended a music magnet school [3], and was a member of her school's choir and church choir [3].

‍

References:

[1] Beyoncé Giselle Knowles-Carter (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child....
[2] At age eight, Beyoncé and childhood friend Kelly Rowland met LaTavia Roberson while in an audition for an all-girl entertainment group. They were placed into a group with three other girls as Girl's Tyme, and rapped and danced on the talent show circuit in Houston. After seeing the group, R&B producer Arne Frager brought them to his Northern California studio and placed them in Star Search, the largest talent show on national TV at the time…
[3] Beyoncé attended St. Mary's Elementary School in Fredericksburg, Texas, where she enrolled in dance classes.... In fall of 1990, Beyoncé enrolled in Parker Elementary School, a music magnet school in Houston, where she would perform with the school's choir. Beyoncé was also a member of the choir at St. John's United Methodist Church as a soloist for two years.

In contrast with standard RAG, fine-grained attributions have been extracted from the source passages, with the most relevant source text highlighted. In contrast with standard attribution methods, here the integrity of the generated response can be verified much faster. The full response gives pointed, valuable context to users to support efficiency and accuracy in their investigation and decision processes.

‍

Comparing Pryon’s attribution to GPT-3.5 & GPT-4

The table below compares the passage-level citation quality of Pryon’s in-house LLM with GPT-3.5 & GPT-4. Pryon’s system utilizes proprietary fine-grained attribution approaches similar to those discussed above, but here we compute attribution performance at passage-level to compare Pryon with typical deployments.

Results are obtained on a Pryon-internal dataset consisting of questions from 4 domains, and an external, publicly available dataset [15]. Citation precision and recall metrics, as well as the attribution rate are adapted from past work [11-14]. Citation Recall computes if the generated claim is entirely supported by all the accompanying citations [11], while Citation Precision and Attribution Rate are computed based on the largest subset of citations that is determined to support each generated claim.

‍
‍

Model	Dataset	Citation Precision	Citation Recall	Attribution Rate
GPT-3.5		77.8	74.1	81.5
GPT-4	Pryon internal	80.9	77.8	85.2
Pryon		88.0	82.4	93.5
GPT-3.5		91.0	90.8	91.1
GPT-4	External dataset (HAGRID [15])	91.5	91.4	91.5
Pryon		92.9	92.7	93.3

‍

As evident from the table above, Pryon’s in-house LLM outperforms the GPT models we evaluated, providing a significant lift in passage-level attribution accuracy. Crucially, Pryon’s in-house LLM also returns fine-grained attributions, which make our systems suitable for even the most demanding information-critical applications.

Active investigation continues in this extremely important and exciting area, as we earnestly strive toward generative systems that are free of hallucinations and attribution errors.

‍

References

Huang et. al., A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions, arXiv 2023
Ji et. al., Survey of Hallucination in Natural Language Generation, ACM Computing Surveys, Vol 55, Issue 12, March 2023
Liu et. al., Exploring and Evaluating Hallucinations in LLM-Powered Code Generation, arXiv 2024
Li et. al., HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models, EMNLP 2023
Slater, Joe, et al. ChatGPT Isn’t ‘hallucinating’ - It’s Just Churning out BS, LiveScience, July 2024
Ma et al., Query Rewriting for Retrieval-Augmented Large Language Models, EMNLP 2023
Shuster et. al., Retrieval Augmentation Reduces Hallucination in Conversation, ACL 2021
Gao et al., Retrieval-Augmented Generation for Large Language Models: A Survey, arXiv 2024.
Chen et al., Benchmarking Large Language Models in Retrieval-Augmented Generation, AAAI 2024.
Omrani et al., Hybrid Retrieval-Augmented Generation Approach for LLMs Query Response Enhancement, ICWR 2024
Gao et. al., Enabling Large Language Models to Generate Text with Citations, EMNLP 2023
Yue et. al., Automatic Evaluation of Attribution by Large Language Models, EMNLP 2023
Gao et al., RARR: Researching and Revising What Language Models Say, Using Language Models. ACL 2023.
Honovich et al., TRUE: Re-evaluating Factual Consistency Evaluation. ACL Dialdoc Workshop 2022.
Kamalloo et. al., HAGRID: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution

Mitigating LLM Hallucinations with Fine-Grained Attribution

Authors

Retrieval-augmented generation (RAG)

Source attribution

The benefits of fine-grained attribution

Consider the following question:

A standard RAG system produces the following result:

Let’s now look at the answer generated when we do fine-grained attribution:

‍

Comparing Pryon’s attribution to GPT-3.5 & GPT-4

More Resources

How to Get Enterprise RAG Right

Retrieval-Augmented Generation Tutorial: Master RAG for Your Enterprise

Pryon Unveils New Offering for Enterprise Retrieval

Pryon-Powered Chatbot Deflects Customer Questions

Retrieval-augmented generation (RAG)

Source attribution

The benefits of fine-grained attribution

Consider the following question:

A standard RAG system produces the following result:

Let’s now look at the answer generated when we do fine-grained attribution:‍

Comparing Pryon’s attribution to GPT-3.5 & GPT-4

More Resources

How to Get Enterprise RAG Right

Retrieval-Augmented Generation Tutorial: Master RAG for Your Enterprise

Pryon Unveils New Offering for Enterprise Retrieval

Pryon-Powered Chatbot Deflects Customer Questions

Let’s now look at the answer generated when we do fine-grained attribution:

‍