LLM: RAG

Large Language Model

Overview

In this blog, I am going to summary the 5 most common generative models: Auto-Regreesive Model, Variational AutoEncoder, Engery Based Model, Flow Model and Diffusion Model. I will also display the pros and cons between different models, and how we can combine those models to get better performance

Published

2025-08-07

Last modified

2025-08-07

Retrieval-Augmented Generation (RAG) is a technique used to equip the LLM the ability to use the updated news. It has here are several, it often use vector DB to store the LLM.

Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases.

This is good for knowledge-intensive tasks, and allows for continuous knowledge updates and integration of domain-specific information. It retrieving relevant document chunks from external knowledge base through semantic similarity calculation. RAG effectively reduces the problem of generating factually incorrect content.

The In-Context Learning(ICL) abilities first display in the GPT3, which enable RAG to answer more complex and knowledge intensive tasks during the inference stages. Unlike previously, which used the knowledge from RAG to fine-tuning model. Later, the enhancement of RAG began to incorporate more with LLM fine-tuning techniques.

As name indicated, there are several techniques used in the “Retrieval”, ‘Generation’ and “Augmentation”.

There are three main parts of the RAG:

Naive RAG
Advanced RAG
Modular RAG

1 Naive RAG

The Naive RAG follows a traditional process that include indexing, retrieval, and generation, which is also characterized as a “Retrieve-Read” framework

Indexing: starts with the cleaning and extraction of raw data in diverse formats like PDF, HTML, Word, and Markdown, which is then converted into a uniform plain text format. Due to the context limitation of the language models, text need to be segmented into smaller, digestible chunks. Each chunks are then encoded into vector representations using as embedding model and stored in vector database.
Retrieval: Upon receive a query from user, the RAG system employs the same encoding model utilized during the indexing phase to transform the query into a vector representation. It then computes the similarity scores between the query vector and the vector of chunks within the indexed corpus. The system return the Top-K chunks, which show greatest similarity to the query. These chunks are then used as the expanded context in prompts.
Generation: The query and chunks are synthesized into a prompt and sent to the LLM to generate response.

As we can see, there are several limitation of the Naive RAG:

Retrieval: The retrieval phase struggle with precision and recall, leading to the selection of misaligned or irrelevant chunks, and missing of crucial information
Generation: When the chunks is irrelevant to the query, the LLM might produces content not supported by the context.
Augmentation Hurdles: Integrating retrieved information with the different task can be challenging, sometimes resulting in disjointed or incoherent outputs. The process may also encounter redundancy when similar information is retrieved from multiple sources, leading to repetitive responses.

2 Advanced RAG

This kind of the RAG introduce specific improvements to overcome the limitations of the Naive RAG.

Retrieval: It employs pre-retrieval and post-retrieval strategies. And also incorporates the optimization methods.
Indexing: refines its indexing techniques through the use of a sliding window approach, fine-grained segmentation, and the incorporation of meta-data.

2.1 Pre-Retrieval Process

The primary focus is on optimizing the indexing structure and the original query.

Optimizing Indexing: enhance the quality of the content being indexed:
- Enhancing data granularity
- Optimizing index structures,
- Adding Meta-data
- Alignment optimization
- Mixed retrieval
Optimizing Query: make the user’s original question clearer and more suitable for the retrieval task, through:
- Query rewritting
- Query Transformation
- Query Expansion

2.2 Post-Retrieval Process

Once relevant context is retrieved, it is crucial to integrate it effectively with the query. The main methods in this stage are:

Re-rank chunks: Re-ranking the retrieved information to relocated the most relevant content to the edges of the prompt
Context Compressing: concentrate on selecting the essential information, emphasizing critical sections, and shortening the context to be processed.

3 Modular RAG

It advances the above two RAG by incorporating diverse strategies for improving its components, such as:

Adding a search module for similarity searches
Refining the retriever through fine-tuning

There are several examples:

restructured RAG
re-arranged RAG

Modular RAG support sequential processing and integrated end-to-end training across its components

3.1 New Modules

3.1.1 Search Module

The search module adapts to specific scenarios, enabling direct searches across various sources like search engines, database, and knowledge graphs, using LLM-generated code and query languages

The RAG-Fusing addresses search limitations by employing a multi-query strategy that expands user queries into diverse perspectives, utilizing parallel vector searches and intelligent re-ranking to uncover both explicit and transformative knowledge, utilizing parallel vector searches and intelligent re-ranking to uncover both explict and transformative knowledge.

3.1.2 Memory Module

The Memory module leverages the LLM’s memory to guide retrieval, creating an unbounded memory pool that aligns the text more closely with data distribution through iterative self-enhancement.

3.1.3 Routing Module

The Routing navigates through diverse data sources, selecting the optimal pathway for a query, whether it involves summarization, database searches, or merging different information streams.

3.1.4 The Preict Module

The predict module aims to reduce redundancy and noise by generating context directly through the LLM, ensuring relevance and accuracy.

3.1.5 Task Adapter Module

The adapter module tailors RAG to various downstream tasks, automating prompt retrieval for zero-shot inputs and creating task-specific retrievers through few-shot query generationg.

3.2 New Pattern

Modular RAG offers remarkable adaptability by allowing module substitution or reconfiguration to address specific challenges. Modular RAG expands the flexibility by integrating new modules or adjusting interaction flow among existing ones, enhancing its applicability across different tasks.