RAG Definition and LLM Glossary

Many new AI-focused companies, business models, technologies, and solutions for different industries have surfaced in the last few years. But AI isn't new to us at Pryon. Our founders spent decades pioneering a number of foundational AI technologies, including notable advancements in natural language processing, multimodal AI, and computer vision, before establishing Pryon in 2017.

We're eager to share what we know about this dynamic space. Use the table of contents pane to navigate to a particular section, or search directly for the term you’re looking for using CMD+F or CTRL+F.

Large language models (LLMs): LLMs are what enable an AI application to provide outputs (responses) to users. LLMs are typically trained on a large set of data so they can learn what to "say," similar to how humans are educated through books and other media so we know how to communicate. Many companies have introduced their own LLMs, such as OpenAI (GPT), Google (Gemini), Meta (LLaMa), and Mistral (Mixtral). At Pryon, we have our own set of LLMs that are expertly designed to ingest enterprise content and provide trustworthy outputs for trusted answers.


Machine learning (ML):
Think of ML as a subset of AI. One of the reasons artificial intelligence can do certain tasks so well (and so much faster than a human could), from predicting equipment failure or identifying the sales opportunity most likely to close, is because it's capable of progressively learning at a rapid rate. ML is what makes it possible for AI to understand data, learn from patterns in the data, make predictions or classify items into groups, and incrementally improve over time. As machine learning technology improves, so does artificial intelligence.

General AI/RAG/LLM Terms

Bounding box and coordinates: To provide attribution to a specific sentence or paragraph within a document, a RAG system needs to be able to assign some kind of value to each content chunk. This is done through bounding box/coordinate values. The RAG system’s computer vision model analyzes the layout of a document and assigns each chunk a coordinate value. These coordinates are stored as metadata within a vector database. When the LLM delivers a response, it retrieves the coordinates of the source chunk, so a link to the source chunk (attribution) can be added to the response.

This concept is also referenced in the realm of object detection, a technique used in computer vision. Object detection models receive an image as input and output coordinates of the bounding boxes and associated labels of the detected objects. An image can contain multiple objects, each with its own bounding box and a label (e.g. it can have a car and a building), and each object can be present in different parts of an image (e.g. the image can have several cars). This task is commonly used in autonomous driving for detecting things like pedestrians, road signs, and traffic lights. Other applications include counting objects in images, image search, and more.


GPUs:
GPUs, or graphics processing units, power LLMs and SLMs (small language models). Historically, computers were powered solely by CPUs, but in the last couple of decades, computer engineers realized GPUs were more efficient at certain computing tasks — not just producing graphics, as the name implies, but also tasks related to AI. As their usage has skyrocketed, GPUs — made by companies like NVIDIA — have consumed more and more energy and require more cooling methods, which can have major negative effects on the environment. Pryon’s proprietary set of LLMs and SLMs use less GPU resources than many competing language models, making Pryon RAG Suite an energy-efficient way to deploy generative AI at enterprise scale.


Neural networks:
Neural networks are used in machine learning as a way to mimic how neurons in the human brain communicate. Just as neurons "fire" together to enable us to do things like walk, talk, and understand a book, neural networks are a series of nodes that work together to accomplish a certain task, such as summarizing text, doing natural language processing (NLP), and recognizing images.

The reason Pryon Ingestion Engine can understand enterprise content like a human would is its advanced deep learning techniques, such as visual segmentation and OCR (optical character recognition). These deep learning techniques are powered by neural networks.


Orchestration:
The process of contributing and managing the interactions between different AI components to work together effectively. This process involves a central platform that manages the deployment, integration, and interaction of AI components such as databases, algorithms, AI models, and neural networks.


REST APIs:
APIs make applications extensible, providing the ability for one application to integrate with another. They establish the content required from the consumer (the call) and the content required by the producer/provider (the response). For example, Weather.com’s API might call for a ZIP code and respond with the high and low temperature for that ZIP code.

REST APIs (aka RESTful APIs) are a kind of API that conforms to a specific architectural standard. Without going into a lot of technical detail, REST APIs are faster, more lightweight, and offer increased scalability than other kinds of APIs.


Small language models (SLMs): Like large language models (LLMs), SLMs are what provide the responses that generative AI application users receive. The difference is that unlike LLMs, which are built for general purpose use, SLMs are built for specific purposes. For instance, an LLM might be capable of everything from writing a poem to penning an academic dissertation, but an SLM might be trained specifically to answer questions posed by users (instead of generating original content). Because SLMs aren’t built for as many use cases as LLMs, they’re smaller in size, which is not only beneficial from a performance standpoint but also makes them far more energy-efficient.


Tokens:
In natural language processing (NLP), a token is a fundamental building block of text. Tokens are the smallest unit into which a piece of text, such as a sentence or document, can be divided. This process of converting text to tokens is called tokenization.

Tokens are typically words, where each word in a sentence is considered a separate token. For example, in the sentence "I love NLP," there are three words and therefore three tokens: "I," "love," and "NLP." However, tokenization can also involve subword, character, or even sentence-level units, depending on the specific task and the tokenization approach used.

Tokens are the basic units of data processed by LLMs, so the more tokens an LLM can process, the more powerful the LLM. (It’s like cellular organisms; the more cells an organism has, the more complex that organism is.)


Transformers:
Transformers are a kind of deep learning architecture designed to process natural language and other data. The transformer-based architecture was first described in a Google whitepaper in 2017 and gave rise to the modern LLM-based AI applications many of us are familiar with, such as ChatGPT. (Unrelated to the toys or Michael Bay films.)

Response attribution: Augments responses with original source attribution, so users can reference the source document to verify the answer provided and/or gather additional context.


Response selection model:
Determines which of the retrieved responses are the most relevant to feed to the generative model. Companies that provide developers with the building blocks for RAG, such as Pryon, have invested time and energy in optimizing their response selection models to provide accurate and helpful responses.


Response summarization:
When a RAG application summarizes the retrieved responses using a generative LLM instead of simply providing answer snippets for the user to peruse. For instance, when asked, “How many employees do we have?” a RAG application could directly tell the user, “ACME company has 15,000 employees,” instead of simply pointing the user to a company overview brochure that says “Employees: 15,000.”