How to Get Enterprise RAG Right
5 Principles for Building Enterprise-Ready Retrieval-Augmented Generation
How Pryon’s in-house LLM combines fine-grained attribution with RAG to deliver more trustworthy generative responses
Kasturi Bhattacharjee, Steven Rennie, and Vaibhava Goel are with Pryon Research.
Large language models (LLMs) have recently demonstrated remarkable performance on a wide range of natural language processing (NLP) tasks and beyond. This has led to their ubiquitous use in many areas, including chatbots, virtual assistants, content generators, coding assistants, language translators, and more.
While exhibiting a tremendous ability to generate content, interpret instructions, and reason, LLMs tend to hallucinate. That is, they can generate content that is inconsistent with real-world facts or inputs [1-4]. This raises questions about the reliability and trustworthiness of LLMs when used in real-world settings and has been the subject of much debate and fanfare both in academic circles and the mainstream media.
Retrieval-augmented generation (RAG) has emerged as an important way to mitigate hallucinations [6-10]. With RAG, information from relevant sources is first retrieved before an AI application generates a response. By conditioning their responses on the information sources, RAG-based applications can generate more accurate and reliable answers.
However, RAG alone is not enough to eliminate hallucinations. LLMs empowered by RAG can still hallucinate facts, even when the information retrieved to support response generation is correct. These incorrect claims can be very difficult to detect, as they are often highly plausible, and stated with the fluidity and confidence of a subject matter expert [5]. To be able to completely trust any response generated by an LLM, we ultimately need to understand where the information in the response was derived from.
To help LLM users understand the origins of generated responses, LLMs can be asked to attribute or cite their sources as they generate. Attributing claims back to their source comes with some major advantages:
Many LLM deployments now attribute generated responses to sources, but this is typically done at the passage level. This has the major limitation that the generation cannot easily be verified, even when the generation is correct, and the citations are accurate.
Pryon’s latest generative LLMs, in contrast, support the fine-grained attribution of claims to sources. Our attribution process is deeply integrated into the generator, so that fine-grained attribution not only boosts interpretability, but also fundamentally improves both response generation and passage-level attribution. This makes Pryon’s generative LLMs suitable for even the most sensitive information-critical applications.
Let’s walk through a quick example to highlight the limitations of traditional attribution and the advantages of the fine-grained approach.
Question: What areas did Beyonce compete in when she was growing up?
The following passages are retrieved for RAG:
Beyoncé competed in singing and dancing competitions as a child, attended a music magnet school, and was a member of her school's choir and church choir [1][3].
In this case, both the answer and passage attributions are correct, but it is not easy to verify that this is the case. The user would have to read the passages to confirm the result, which is generally very tedious, as many more passages are typically utilized when employing RAG.
Beyoncé competed in singing and dancing competitions as a child [1], attended a music magnet school [3], and was a member of her school's choir and church choir [3].
References:
In contrast with standard RAG, fine-grained attributions have been extracted from the source passages, with the most relevant source text highlighted. In contrast with standard attribution methods, here the integrity of the generated response can be verified much faster. The full response gives pointed, valuable context to users to support efficiency and accuracy in their investigation and decision processes.
The table below compares the passage-level citation quality of Pryon’s in-house LLM with GPT-3.5 & GPT-4. Pryon’s system utilizes proprietary fine-grained attribution approaches similar to those discussed above, but here we compute attribution performance at passage-level to compare Pryon with typical deployments.
Results are obtained on a Pryon-internal dataset consisting of questions from 4 domains, and an external, publicly available dataset [15]. Citation precision and recall metrics, as well as the attribution rate are adapted from past work [11-14]. Citation Recall computes if the generated claim is entirely supported by all the accompanying citations [11], while Citation Precision and Attribution Rate are computed based on the largest subset of citations that is determined to support each generated claim.
Model | Dataset | Citation Precision | Citation Recall | Attribution Rate |
---|---|---|---|---|
GPT-3.5 | 77.8 | 74.1 | 81.5 | |
GPT-4 | Pryon internal | 80.9 | 77.8 | 85.2 |
Pryon | 88.0 | 82.4 | 93.5 | |
GPT-3.5 | 91.0 | 90.8 | 91.1 | |
GPT-4 | External dataset (HAGRID [15]) | 91.5 | 91.4 | 91.5 |
Pryon | 92.9 | 92.7 | 93.3 |
As evident from the table above, Pryon’s in-house LLM outperforms the GPT models we evaluated, providing a significant lift in passage-level attribution accuracy. Crucially, Pryon’s in-house LLM also returns fine-grained attributions, which make our systems suitable for even the most demanding information-critical applications.
Active investigation continues in this extremely important and exciting area, as we earnestly strive toward generative systems that are free of hallucinations and attribution errors.
References