Agentic ai

Agentic AI: Data Ingestion and Retrieval Techniques

A Comprehensive Lesson from First Principles


Table of Contents

  1. What is Agentic AI?
  2. Why Ingestion and Retrieval Matter
  3. Part 1: Data Ingestion Techniques
  4. Part 2: Data Retrieval Techniques
  5. Part 3: Advanced Techniques
  6. Part 4: Vector Databases and Storage Architecture
  7. Part 5: End-to-End Architecture Example
  8. Part 6: Evaluation and Quality Metrics
  9. Glossary

Introduction

1. What is Agentic AI?

Before diving into ingestion and retrieval, it helps to understand what "Agentic AI" actually means.

A traditional AI chatbot takes a question and generates a single response from its pre-trained knowledge. That is it. It has no memory of your documents, no ability to look things up, and no ability to take actions.

Agentic AI is different. It is a system where an AI (typically a Large Language Model, or LLM) can:

  • Use tools (search the web, query a database, run code, send emails)
  • Retrieve knowledge from your own private documents and data
  • Take multi-step actions, deciding what to do next at each step
  • Remember context across a conversation or workflow

Think of an agent as a capable employee who, rather than only drawing on what they memorised in university, can also reach into a filing cabinet, search a database, run a calculation, and call a colleague, all to answer your question.

Example to Ground This

Imagine a bank has 10,000 internal policy documents. A customer asks: "What is the process for disputing a foreign transaction?"

A traditional LLM would guess or hallucinate an answer. An agentic AI would:

  1. Search the internal document store for relevant policy documents
  2. Pull the most relevant chunks of text
  3. Reason over those chunks
  4. Produce an accurate answer grounded in real documents

The magic that makes this possible is ingestion (loading and indexing documents) and retrieval (finding the right document chunks at query time).


2. Why Ingestion and Retrieval Matter

LLMs have a fixed "knowledge cutoff" (they only know what they were trained on) and a limited "context window" (they can only read a certain amount of text at once). You cannot just dump 10,000 documents into an LLM prompt.

The solution is Retrieval-Augmented Generation (RAG):

  1. At ingestion time: Process and store your documents in a searchable format
  2. At query time: Retrieve only the most relevant pieces and pass them to the LLM

This means the LLM only needs to read 5-10 document snippets at a time instead of 10,000 full documents. Retrieval solves the "needle in a haystack" problem efficiently.


Part 1: Data Ingestion Techniques

Ingestion is everything that happens before a user asks a question. You are preparing your data so it can be found quickly and accurately later.


3.1 Raw Document Ingestion

The first step is simply getting your data into the pipeline. Documents come in many formats and each requires different handling.

Document Loaders

Document loaders are components that read raw files and produce plain text (plus metadata like filename, page number, author).

Source TypeExamplesTool/Approach
PDFsPolicy docs, contracts, reportsPyMuPDF, pdfplumber, Unstructured.io
Word (.docx)HR policies, proposalspython-docx, Unstructured.io
Web pagesProduct docs, WikipediaBeautifulSoup, Playwright, Firecrawl
DatabasesSQL tables, NoSQL recordsSQLAlchemy connectors
APIsConfluence, Notion, SalesforceLangChain / LlamaIndex integrations
CodeGitHub repositoriesAST parsers, tree-sitter
Audio/VideoMeetings, podcastsWhisper (speech-to-text) then text
SpreadsheetsExcel, CSVpandas, openpyxl
EmailOutlook, Gmail threadsMIME parsers

Example: Ingesting a PDF Policy Document

from llama_index.readers.file import PDFReader

loader = PDFReader()
documents = loader.load_data(file="bank_dispute_policy.pdf")

# Result: A list of Document objects, one per page
# Each Document has:
#   - text: "To dispute a foreign transaction, the customer must..."
#   - metadata: {"file_name": "bank_dispute_policy.pdf", "page_label": "4"}

Key Challenge: Noisy Input

Raw documents are messy. PDFs often contain:

  • Headers and footers that repeat on every page ("CONFIDENTIAL | PAGE 4")
  • Tables that extract as garbled text
  • Scanned images that contain no machine-readable text

Solutions:

  • Use OCR (Optical Character Recognition) tools like Tesseract for scanned documents
  • Use Unstructured.io which handles layout-aware extraction for complex PDFs
  • Pre-process to strip boilerplate (page numbers, legal disclaimers that appear on every page)

3.2 Chunking Strategies

After loading, you need to split documents into smaller pieces called chunks. You do this because:

  1. An LLM's context window is limited (you can only feed it so much text)
  2. Embedding models (covered next) work best on focused, short passages
  3. Retrieval precision improves when chunks are topically focused

Think of chunking like cutting a textbook into flash cards. Each flash card covers one concept clearly.

Strategy 1: Fixed-Size Chunking

Split every N characters or tokens, regardless of content.

Input: "The customer must submit Form A. Then the team reviews it. Approval takes 5 days."

Chunk 1 (100 chars): "The customer must submit Form A. Then the team reviews"
Chunk 2 (100 chars): " it. Approval takes 5 days."

Problem: Chunks can split mid-sentence, losing meaning. Chunk 2 starts with "it" but "it" refers to something in Chunk 1.

Fix: Use overlap. Repeat 10-20% of the previous chunk at the start of the next.

Chunk 1: "The customer must submit Form A. Then the team reviews it."
Chunk 2: "Then the team reviews it. Approval takes 5 days."  ← overlaps

This ensures context is not lost across boundaries.

Strategy 2: Recursive Character Splitting

Split by natural boundaries in this priority order: paragraph breaks, then sentence breaks, then word breaks. This is the most common default approach.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,       # max characters per chunk
    chunk_overlap=50,     # overlap between chunks
    separators=["\n\n", "\n", ". ", " "]  # tries these in order
)
chunks = splitter.split_text(document_text)

The splitter first tries to break on double newlines (paragraphs). If a paragraph is still too big, it tries single newlines (sentences). And so on.

Strategy 3: Semantic Chunking

Instead of splitting at fixed sizes, split at semantic boundaries -- where the topic actually changes.

How it works:

  1. Split text into sentences
  2. Embed each sentence (convert to a vector, explained in section 3.3)
  3. Compute similarity between adjacent sentences
  4. When similarity drops sharply, it indicates a topic change -- cut there
Sentence 1: "To dispute a charge, call 1800-XXX."         ← topic: disputes
Sentence 2: "You can also use our mobile app."             ← topic: disputes
Sentence 3: "Our savings accounts offer 3.5% interest."    ← TOPIC CHANGE
Sentence 4: "Rates are reviewed quarterly."                ← topic: savings

Semantic chunking would produce one chunk for sentences 1-2 and another for 3-4. This is far more natural than cutting at character 500.

Tool: LangChain's SemanticChunker or LlamaIndex's SemanticSplitterNodeParser.

Strategy 4: Document Structure-Aware Chunking

Use the document's own structure (headings, sections, paragraphs) as boundaries.

# Section: Dispute Process         <- Heading 1
## Step 1: Submit a request        <- Heading 2
Fill out Form A online...          <- content chunk under Heading 2

## Step 2: Wait for review         <- Heading 2
The team will contact you...       <- content chunk under Heading 2

Each chunk retains which section it came from. This is powerful because you can later filter by section.

Strategy 5: Agentic / Proposition-Based Chunking

This is a newer, more sophisticated approach. Use an LLM itself to reformulate each paragraph into a set of standalone, self-contained propositions.

Before:

"It can be submitted by the customer or their representative."

After (LLM-generated propositions):

"A dispute form can be submitted by the customer."
"A dispute form can be submitted by the customer's representative."

Each proposition is a complete, self-contained fact. This dramatically improves retrieval because there is no ambiguity about what "it" refers to.

Choosing Chunk Size: Rules of Thumb

Use CaseRecommended Chunk Size
Factual Q&A (precise answers needed)128-256 tokens
General document Q&A256-512 tokens
Summarisation tasks512-1024 tokens
Code retrievalWhole function or class

Always test empirically -- wrong chunk size is one of the top reasons RAG systems fail.


3.3 Embedding: Turning Text into Numbers

This is the most important concept in modern AI retrieval. To search for text by meaning (not just keywords), you need to convert text into vectors -- lists of numbers that capture semantic meaning.

What is an Embedding?

An embedding is a fixed-length list of decimal numbers that represents the "meaning" of a piece of text in a high-dimensional mathematical space.

"How do I dispute a charge?"
→ [0.23, -0.87, 0.45, 0.12, -0.33, ..., 0.67]  (e.g. 1536 numbers)

"Steps to challenge a transaction"
→ [0.21, -0.85, 0.44, 0.14, -0.31, ..., 0.65]  (very similar numbers!)

"What is the weather today?"
→ [-0.55, 0.12, -0.78, 0.93, 0.44, ..., -0.22]  (very different numbers)

Texts with similar meanings produce vectors that are close together in this high-dimensional space. Texts with different meanings produce vectors that are far apart.

This is called the semantic similarity property of embeddings.

How Similarity is Measured

The most common measure is cosine similarity. It measures the angle between two vectors. If the angle is small (vectors point in the same direction), the texts are semantically similar.

Cosine similarity of 1.0  = identical meaning
Cosine similarity of 0.9  = very similar
Cosine similarity of 0.5  = somewhat related
Cosine similarity of 0.0  = completely unrelated

Embedding Models

Different models produce embeddings of different quality and dimensions. Here are the most commonly used ones in 2024-2025:

ModelProviderDimensionsNotes
text-embedding-3-largeOpenAI3072Best general quality
text-embedding-3-smallOpenAI1536Fast, cheap, very good
embed-english-v3.0Cohere1024Strong, good for enterprise
voyage-3Voyage AI1024State of the art for RAG
nomic-embed-textNomic (open source)768Free, runs locally
bge-large-en-v1.5BAAI (open source)1024Strong open source option
mxbai-embed-largeMixedBread (open source)1024Excellent, can run locally

Example: Embedding Chunks

from openai import OpenAI

client = OpenAI()

def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding  # list of 1536 floats

chunk = "To dispute a foreign transaction, the customer must call within 60 days."
vector = embed(chunk)
# vector = [0.23, -0.12, ..., 0.67]  (1536 numbers)

This vector is what gets stored in a vector database alongside the original text.

Important: Embedding Symmetry

You must embed both documents AND queries using the same model. If you embed documents with model A and queries with model B, similarity scores will be meaningless.

Some models are asymmetric -- they use a different prefix for queries vs documents. For example, with bge-large-en-v1.5:

# Documents are embedded without a prefix
doc_vector = embed("dispute process requires Form A")

# Queries use a special prefix
query_vector = embed("Represent this sentence for searching relevant passages: How do I dispute a charge?")

Always check the model documentation to know if asymmetric embedding is required.


3.4 Metadata Enrichment

When you store a chunk in a vector database, you should always store metadata alongside the vector. Metadata is structured information about the chunk that enables filtering.

Why Metadata Matters

Without metadata, you search all chunks equally. With metadata, you can narrow down:

"Only search in Policy documents from the Retail Banking division, published after 2023."

Types of Metadata to Attach

Source metadata (automatically extractable):

{
  "source_file": "retail_banking_disputes_policy_v3.pdf",
  "page_number": 4,
  "file_type": "pdf",
  "created_date": "2024-01-15",
  "last_modified": "2024-08-01"
}

Structural metadata (from document structure):

{
  "section_heading": "Foreign Transaction Disputes",
  "parent_heading": "Dispute Resolution",
  "document_type": "policy",
  "chapter": 3
}

Semantic metadata (LLM-generated at ingestion time):

{
  "summary": "This chunk explains the 60-day window for disputing foreign charges.",
  "keywords": ["dispute", "foreign transaction", "60 days", "Form A"],
  "entities": ["Form A", "Dispute Resolution Team"],
  "topic": "dispute process",
  "audience": "retail banking customers"
}

Using an LLM to generate summaries and keywords at ingestion time is expensive but significantly improves retrieval quality.

Hypothetical Questions Metadata

A powerful technique: for each chunk, use an LLM to generate 3-5 hypothetical questions that the chunk would answer. Store these as metadata.

Chunk: "To dispute a foreign transaction, the customer must submit Form A within 60 days."

Generated questions:
- "What is the deadline to dispute a foreign charge?"
- "Which form do I use to challenge an overseas transaction?"
- "How long do I have to report an incorrect foreign payment?"

When a user asks one of these questions, you can match against both the chunk text AND the pre-generated questions, dramatically improving recall.


3.5 Multimodal Ingestion (Images, Audio, Tables)

Modern agentic AI is not limited to text. Here is how non-text content is ingested.

Tables

Tables in PDFs are notoriously hard to extract. Options:

Option 1: Convert to text description
Use an LLM or a specialised tool to describe the table in natural language.

Original table:
| Product | Rate | Min Balance |
| Savings | 3.5% | $500        |
| Premium | 4.2% | $10,000     |

Converted to text:
"The Savings account offers a 3.5% rate with a minimum balance of $500. 
The Premium account offers 4.2% with a minimum balance of $10,000."

Option 2: Store as structured JSON and index both the JSON and a text description

Option 3: Use table-aware extraction tools like Unstructured.io or Camelot (for PDFs).

Images

Option 1: Vision LLM captioning
Pass each image through a multimodal LLM (like GPT-4o or Claude) and generate a text caption/description. Store the description as a searchable text chunk.

# At ingestion time, for each image in a document:
caption = vision_llm.describe(image)
# "A bar chart showing monthly transaction disputes from Jan-Dec 2023,
#  with a peak of 1,245 disputes in August."

# Store caption as an embeddable chunk alongside source metadata

Option 2: Multimodal Embeddings
Models like CLIP or OpenAI's multimodal embeddings can embed both text and images in the same vector space. This allows you to search images with text queries directly.

Audio

  1. Transcribe audio to text using Whisper (OpenAI's open-source model) or AWS Transcribe
  2. Add speaker labels and timestamps as metadata
  3. Chunk the transcript by speaker turn or time window
  4. Embed and index like regular text

3.6 Knowledge Graph Ingestion

Rather than storing text chunks in isolation, a Knowledge Graph stores entities and the relationships between them.

Example

From the text "Alice manages the Dispute Resolution Team, which reports to the Operations Division":

A knowledge graph would extract:

  • Entity: Alice (type: Person)
  • Entity: Dispute Resolution Team (type: Team)
  • Entity: Operations Division (type: Division)
  • Relationship: Alice MANAGES Dispute Resolution Team
  • Relationship: Dispute Resolution Team REPORTS_TO Operations Division

This is stored as a graph (nodes and edges) rather than text chunks.

Why This Matters for Agentic AI

Graphs excel at multi-hop queries -- questions that require connecting information across multiple documents.

"Who is ultimately responsible for approving foreign transaction disputes?"

To answer this, an agent might need to traverse:

  • Alice manages the Dispute Resolution Team
  • The team reports to the Operations Division
  • The Operations Division head is Bob
  • Bob's approval is required for disputes over $10,000

A flat text-chunk system would struggle. A graph makes this traversal explicit and efficient.

Tools for Knowledge Graph Ingestion

  • LlamaIndex PropertyGraphIndex: Extracts entities and relations from text using an LLM and stores in a graph
  • Microsoft GraphRAG: Builds community-level summaries and graphs from large corpora
  • Neo4j + LangChain: Store and query property graphs
  • NebulaGraph: Distributed graph database for large-scale deployments

3.7 Ingestion Pipelines and Orchestration

In production, ingestion is not a one-time batch job. Documents are added, updated, and deleted continuously. You need a pipeline.

Pipeline Stages

[Source Documents]
      |
      v
[Document Loader]     <- Read raw files (PDF, DOCX, web, API)
      |
      v
[Pre-processing]      <- Clean noise, extract structure, OCR if needed
      |
      v
[Chunking]            <- Split into appropriately-sized pieces
      |
      v
[Metadata Enrichment] <- Add source info, LLM-generated summaries/questions
      |
      v
[Embedding]           <- Convert each chunk to a vector
      |
      v
[Vector Store]        <- Store vectors + metadata (Pinecone, Weaviate, etc.)
      |
      v
[Ready for Retrieval] <- Queryable index

Change Detection and Incremental Ingestion

You do not want to re-ingest 10,000 documents every time one file changes. Solutions:

  • File hashing: Store an MD5 or SHA hash of each document. Only re-ingest if the hash changes.
  • Timestamp tracking: Only re-ingest files modified since the last pipeline run.
  • Soft deletes: When a document is deleted, mark its chunks as inactive in the vector store rather than hard-deleting (for audit trails).
FrameworkDescription
LangChainGeneral-purpose, massive ecosystem of loaders and splitters
LlamaIndexOptimised specifically for RAG and indexing
Unstructured.ioBest-in-class for complex document parsing
HaystackProduction-ready ML pipelines
Apache AirflowGeneral workflow orchestration (pairs well with the above)

Part 2: Data Retrieval Techniques

Retrieval is what happens at query time, when a user asks a question. The goal: find the most relevant chunks from your indexed store to give the LLM the right context.


Dense retrieval is the foundation of modern RAG. It uses the same embedding approach from ingestion.

How It Works

  1. User asks: "What is the deadline to dispute a foreign transaction?"
  2. Convert the query to a vector using the same embedding model used at ingestion
  3. Search the vector database for the chunks whose vectors are closest to the query vector
  4. Return the top-K most similar chunks (typically K=3 to K=10)
query = "What is the deadline to dispute a foreign transaction?"
query_vector = embed(query)  # [0.21, -0.85, 0.44, ...]

# Search vector database
results = vector_db.query(
    vector=query_vector,
    top_k=5,                # Return top 5 results
    include_metadata=True
)

# results contains the 5 most semantically similar chunks
for result in results:
    print(result.score)      # e.g. 0.92 (cosine similarity)
    print(result.text)       # "To dispute a foreign transaction, submit Form A within 60 days..."
    print(result.metadata)   # {"source": "policy.pdf", "page": 4}

Strengths

  • Finds semantic matches even when exact words differ ("deadline" matches "time limit")
  • Works across paraphrases, synonyms, different phrasings
  • Handles multilingual queries well with multilingual embedding models

Weaknesses

  • Can miss exact keyword matches (searching "Form A" might not return the exact form reference)
  • Performance degrades for highly technical, domain-specific terminology not well-represented in training data
  • Requires an embedding model and vector store infrastructure

4.2 Sparse Retrieval (Keyword Search / BM25)

Sparse retrieval is the classic approach used by search engines for decades. It is based on matching keywords, not semantic meaning.

BM25 (Best Match 25)

BM25 is the industry standard sparse retrieval algorithm. It ranks documents based on:

  • Term frequency: How often does the query word appear in the document?
  • Inverse document frequency: Is the word rare (high value) or common (low value)?
  • Document length normalisation: Longer documents should not be unfairly favoured
Query: "Form A dispute process"

BM25 scores each chunk based on how many query words appear,
how rare those words are across all chunks, and chunk length.

Chunk 1: "Submit Form A within 60 days..."         → score: 8.4 (high -- "Form A" is specific)
Chunk 2: "The dispute process begins with..."      → score: 6.2 (matches "dispute process")
Chunk 3: "Contact us for any banking needs..."     → score: 0.1 (no keywords match)

Strengths

  • Excellent for exact keyword matches (product codes, names, form numbers)
  • Fast, no GPU required
  • Interpretable (you know exactly why a result was returned)
  • Works well for domain-specific jargon

Weaknesses

  • Cannot handle synonyms or paraphrases (will miss "deadline" if the doc says "time limit")
  • Sensitive to typos
  • Requires users to use the "right" vocabulary

Tools

  • OpenSearch / Elasticsearch (industrial-scale BM25 with many enhancements)
  • Apache Lucene (underlying engine behind Elasticsearch)
  • BM25s (fast Python implementation)
  • Weaviate, Qdrant, and most vector databases now include BM25 as a built-in option

4.3 Hybrid Retrieval

Since dense (semantic) and sparse (keyword) retrieval each have complementary strengths, modern systems combine both. This is called hybrid retrieval and it consistently outperforms either approach alone.

How It Works

  1. Run the query through both a vector search and a BM25 search in parallel
  2. Get two separate ranked result lists
  3. Fuse (combine) the two lists into a single ranked list
  4. Return the top results
Dense results (semantic):
  1. "The time limit for challenging international charges is 60 days." (score: 0.91)
  2. "Customers may contest foreign payments by filling in Form A." (score: 0.88)
  3. "Dispute windows close two months after the transaction date." (score: 0.85)

Sparse results (BM25):
  1. "Submit Form A within 60 days to dispute foreign transactions." (score: 9.2)
  2. "Form A is required for all foreign dispute submissions." (score: 7.8)
  3. "See the dispute policy for Form A submission guidelines." (score: 6.1)

After fusion:
  1. "Submit Form A within 60 days..." (appeared in BOTH lists -- highest confidence)
  2. "Customers may contest foreign payments..." (top semantic match)
  3. "Form A is required for all foreign dispute submissions." (strong keyword match)

Reciprocal Rank Fusion (RRF)

RRF is the most common fusion algorithm. For each result in each list, it assigns a score based on rank position, then sums scores across lists.

RRF score for a document d = sum over each list of: 1 / (k + rank(d))
where k is a constant (typically 60)

A document ranked #1 in both lists scores much higher than one ranked #5 in one list and absent in another. This simple formula is robust and works well in practice.

# Pseudocode for RRF
def reciprocal_rank_fusion(dense_results, sparse_results, k=60):
    scores = {}
    for rank, doc in enumerate(dense_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)
    for rank, doc in enumerate(sparse_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Tools Supporting Hybrid Retrieval

  • Weaviate: Native hybrid search with RRF
  • Qdrant: Built-in sparse + dense fusion
  • Elasticsearch: Reciprocal rank fusion support
  • LangChain EnsembleRetriever: Combines any two retrievers with RRF
  • Pinecone: Hybrid search with sparse-dense index

4.4 Reranking

After retrieval, you have a shortlist of candidate chunks (e.g. the top 20). Reranking is a second, more expensive scoring step that re-orders them to pick the best 5.

Why Not Just Use the Top-K from Retrieval?

First-stage retrieval (vector search or BM25) is fast but approximate. It finds "probably relevant" documents. A reranker does a deeper, more accurate relevance assessment.

Analogy: think of retrieval as a recruiter shortlisting 20 CVs, and reranking as the hiring manager picking the top 5 for interview. The recruiter is fast but may miss nuance. The manager takes more time but makes better decisions.

Cross-Encoder Rerankers

Cross-encoders take the query and each candidate document together as input and output a single relevance score. Unlike embeddings (which encode query and document separately), cross-encoders see both simultaneously, enabling richer comparison.

Input to cross-encoder:
  [CLS] What is the deadline to dispute a foreign transaction? [SEP] 
  Submit Form A within 60 days to dispute foreign transactions. [SEP]

Output: 0.97 (very relevant)

Input:
  [CLS] What is the deadline to dispute a foreign transaction? [SEP]
  Contact us for any banking needs. [SEP]

Output: 0.04 (not relevant)
ModelProviderNotes
rerank-english-v3.0CohereExcellent, API-based
rerank-3.5CohereLatest, best quality
bge-reranker-largeBAAIOpen source, runs locally
ms-marco-MiniLMMicrosoftFast, good quality
Jina Reranker v2Jina AIMultilingual support
rankllmVariousUses an LLM itself as a reranker

Typical Retrieval + Reranking Pipeline

# Step 1: Fast retrieval -- get top 20 candidates
candidates = vector_db.query(query_vector, top_k=20)

# Step 2: Rerank -- score each candidate against the query
import cohere

co = cohere.Client(api_key)
reranked = co.rerank(
    query="What is the deadline to dispute a foreign transaction?",
    documents=[c.text for c in candidates],
    model="rerank-english-v3.0",
    top_n=5  # Keep only top 5
)

# reranked.results contains the 5 most relevant chunks

The typical pattern is: retrieve 20-50 with fast vector search, then rerank to top 5-10. This balances speed and accuracy.


4.5 Query Transformation Techniques

The user's original query is often not the best query for retrieval. Query transformation improves retrieval by modifying or expanding the query before searching.

Technique 1: Query Rewriting

Use an LLM to rewrite the query into a more explicit form that retrieval can handle better.

Original query: "What happens next after I send the form?"

Rewritten query: "What is the process after submitting Form A for a foreign transaction dispute? 
What steps follow the submission and what is the expected timeline?"

The rewritten query is more specific, uses more keywords, and is more likely to match relevant chunks.

Technique 2: Step-Back Prompting

When a query is very specific, sometimes you need to retrieve context at a higher level of abstraction first.

Specific query: "What happens if I dispute a charge made on 3rd March but the 60-day deadline was 2nd March?"

Step-back query: "What are the exceptions and grace periods in the foreign transaction dispute policy?"

The step-back query retrieves the broader policy context, which likely contains information about exceptions, which then informs answering the specific question.

Technique 3: HyDE (Hypothetical Document Embeddings)

A clever approach: use an LLM to generate a hypothetical answer to the query, then embed and search using that hypothetical answer instead of the original query.

Query: "What is the dispute deadline for foreign transactions?"

Step 1 -- LLM generates a hypothetical answer:
"The dispute deadline for foreign transactions is typically 60 days from the transaction date. 
Customers must submit Form A with supporting documentation..."

Step 2 -- Embed the hypothetical answer (not the original query)

Step 3 -- Search for chunks similar to this hypothetical answer

Why does this work? The hypothetical answer is in "document language" (the kind of text that would appear in the policy), so it aligns better with how the real answer is written in the policy document.

HyDE is covered in more depth in section 5.3.

Technique 4: Multi-Query Generation

Generate multiple variations of the original query and retrieve for each. Merge and deduplicate results.

Original query: "dispute foreign transaction"

Generated variants:
1. "How do I challenge an international charge on my account?"
2. "What is the process for contesting a foreign payment?"
3. "Steps to dispute an overseas transaction"
4. "Foreign transaction dispute form and deadline"

Retrieve for all 4 queries, merge results, deduplicate

This is sometimes called query expansion and ensures that wording variations do not cause you to miss relevant documents.

# LangChain MultiQueryRetriever
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")
retriever = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(),
    llm=llm
)
results = retriever.invoke("dispute foreign transaction")

Technique 5: Decomposition (Sub-Question Generation)

For complex, multi-part queries, decompose into simpler sub-questions, retrieve for each, then synthesise.

Complex query: "Which division handles disputes, how long do they take, and who approves them?"

Decomposed:
1. "Which division handles foreign transaction disputes?"
2. "How long does the dispute resolution process take?"
3. "Who has approval authority for disputes?"

Retrieve for each sub-question separately.
Combine the 3 sets of retrieved chunks.
LLM synthesises a unified answer from all retrieved context.

4.6 Contextual Compression and Filtering

Even after retrieval, chunks may contain irrelevant sentences mixed with relevant ones. Contextual compression extracts only the relevant parts.

How It Works

Retrieved chunk (full text):
"Our bank was founded in 1924. We offer personal and business banking solutions.
To dispute a foreign transaction, submit Form A within 60 days.
We also offer home loans and investment products.
Our customer service team is available 24/7."

After contextual compression (only relevant to the dispute query):
"To dispute a foreign transaction, submit Form A within 60 days."

The compressed chunk is smaller, more focused, and uses less of the LLM's context window.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vector_store.as_retriever()
)

Metadata Filtering

Before or during retrieval, apply hard filters on metadata to narrow the search space.

# Only search in documents from the "Retail Banking" division
# AND published after 2023-01-01
results = vector_db.query(
    vector=query_vector,
    top_k=5,
    filter={
        "division": "Retail Banking",
        "published_after": "2023-01-01",
        "document_status": "active"
    }
)

This is equivalent to SQL's WHERE clause applied to vector search.


4.7 Agentic / Multi-Step Retrieval

This is where retrieval becomes truly agentic. Instead of a single retrieve-then-generate step, an agent iteratively retrieves, reasons, and decides whether to retrieve more.

Iterative Retrieval (ReAct Pattern)

The agent follows a Reason-Act-Observe loop:

User: "Who approves disputes over $10,000 and how should I contact them?"

Agent loop:

THOUGHT 1: "I need to find who approves disputes over $10,000."
ACTION 1: Search("approval authority for disputes over $10,000")
OBSERVATION 1: "Disputes over $10,000 require sign-off from the Operations Director."

THOUGHT 2: "Now I need contact details for the Operations Director."
ACTION 2: Search("Operations Director contact details")
OBSERVATION 2: "Operations Director: Sarah Chen, sarah.chen@bank.com, ext. 4521"

THOUGHT 3: "I have enough information to answer fully."
FINAL ANSWER: "Disputes over $10,000 are approved by the Operations Director, 
Sarah Chen. She can be reached at sarah.chen@bank.com or extension 4521."

Each retrieval is informed by the result of the previous one. This is fundamentally different from a single-step RAG system.

Self-Querying Retrieval

The agent automatically constructs metadata filters from natural language.

User query: "What disputes policies were updated last year in the retail division?"

Self-querying extracts:
- Semantic search query: "disputes policies"
- Metadata filters: 
    division = "retail"
    updated_date >= "2024-01-01" AND updated_date <= "2024-12-31"

The LLM parses the natural language query into both a semantic component and structured filters, then executes a filtered vector search.

Parent-Child Retrieval

A two-level chunking strategy that improves context quality:

  • Child chunks: Small, precise chunks (128 tokens) used for embedding and retrieval. Small size = high precision in matching.
  • Parent chunks: Larger chunks (512 tokens) returned to the LLM after a child chunk is matched. Larger size = more context for the LLM to reason over.
Parent chunk (what LLM sees):
"[Section 4: Foreign Transactions]
All foreign transaction disputes must be initiated within 60 days of the transaction date.
The customer must complete Form A, available at any branch or online at bank.com/forms.
Supporting documentation such as bank statements should be attached.
Disputes initiated after 60 days will be declined except in cases of fraud or system error..."

Child chunk (what was matched by vector search):
"The customer must complete Form A, available at any branch or online at bank.com/forms."

The child chunk matched the query precisely. But the LLM receives the full parent section as context, giving it the surrounding information needed to reason correctly.


4.8 Structured Data Retrieval (Text-to-SQL)

Not all data is in documents. Often the most important data lives in databases. Agentic AI can query structured databases using natural language through Text-to-SQL.

How Text-to-SQL Works

User: "How many disputes were filed in March 2024 and what was the average resolution time?"

Step 1 -- Agent retrieves database schema:
Tables: disputes(id, customer_id, date_filed, date_resolved, amount, status, type)
        customers(id, name, account_type, region)

Step 2 -- LLM generates SQL:
SELECT 
  COUNT(*) as total_disputes,
  AVG(DATEDIFF(date_resolved, date_filed)) as avg_resolution_days
FROM disputes
WHERE date_filed >= '2024-03-01' AND date_filed < '2024-04-01'
  AND date_resolved IS NOT NULL;

Step 3 -- Execute SQL, return result:
{"total_disputes": 347, "avg_resolution_days": 8.2}

Step 4 -- LLM formats natural language answer:
"In March 2024, there were 347 disputes filed. The average resolution time was 8.2 days."

Making Text-to-SQL Reliable

Text-to-SQL fails silently -- wrong SQL returns wrong numbers with no error. Mitigation strategies:

  • Schema enrichment: Add rich descriptions to each column (not just amount, but amount: The disputed transaction amount in AUD, always positive)
  • Few-shot examples: Include 5-10 example query-SQL pairs in the prompt to guide the LLM
  • SQL validation: Run EXPLAIN before EXECUTE to catch syntax errors
  • Row-limit safety: Always append LIMIT 100 to prevent full table scans
  • Read-only connections: Never give the agent a connection with INSERT/UPDATE/DELETE rights

Tools

  • LangChain SQLDatabaseChain: Simple Text-to-SQL integration
  • LlamaIndex NLSQLTableQueryEngine: More sophisticated with schema awareness
  • Vanna.ai: Trains a Text-to-SQL model on your specific database
  • DSPY: Programmatic prompt optimisation for Text-to-SQL

4.9 Knowledge Graph Retrieval

For queries requiring multi-hop reasoning, knowledge graph retrieval traverses relationships between entities.

Graph Traversal Retrieval

Query: "Which team handles disputes and who manages them?"

Graph traversal:
1. Start at entity: "foreign transaction dispute" (matches query)
2. Traverse: dispute HANDLED_BY "Dispute Resolution Team"
3. Traverse: "Dispute Resolution Team" MANAGED_BY "Alice Johnson"

Answer: "Foreign transaction disputes are handled by the Dispute Resolution Team, 
         managed by Alice Johnson."

This multi-hop traversal would be very difficult with flat text retrieval, as the relationship might not be stated explicitly in a single chunk.

Cypher Query Generation (Neo4j)

Similar to Text-to-SQL, an LLM can generate graph query language.

User: "Find all policies that Alice Johnson is responsible for."

LLM generates Cypher:
MATCH (p:Person {name: "Alice Johnson"})-[:RESPONSIBLE_FOR]->(doc:Document)
RETURN doc.title, doc.last_updated

Result: [
  {"title": "Foreign Transaction Dispute Policy", "last_updated": "2024-01-15"},
  {"title": "Chargeback Guidelines", "last_updated": "2023-11-20"}
]

Part 3: Advanced Techniques


5.1 RAG Variants

RAG has evolved from a simple "retrieve then generate" pattern into a family of increasingly sophisticated architectures.

Naive RAG (The Baseline)

Query → Embed → Vector Search → Top-K Chunks → LLM → Answer

Simple pipeline. Works for straightforward Q&A but has clear limitations:

  • One retrieval step (no iteration)
  • No query transformation
  • No reranking
  • Poor handling of complex, multi-part questions

Advanced RAG

Adds improvements at every stage:

  • Pre-retrieval: query rewriting, query decomposition
  • Retrieval: hybrid search (dense + sparse), metadata filtering
  • Post-retrieval: reranking, contextual compression

Still a single-pass pipeline but much higher quality.

Modular RAG

Treats each component (retriever, reranker, generator, memory) as a swappable module. You compose a pipeline from modules based on the task.

For a simple FAQ: Retriever → Generator
For complex research: QueryDecomposer → MultiRetriever → Reranker → Synthesiser → Generator

Agentic RAG

The LLM actively controls the retrieval process. It decides:

  • Whether to retrieve at all (maybe it already knows the answer)
  • What to retrieve (formulates its own queries)
  • Whether the retrieved information is sufficient (or whether to search again)
  • Which tool to use (vector search, SQL, web search, knowledge graph)

This is the state of the art and is what production agentic systems look like.


5.2 RAPTOR and Hierarchical Indexing

RAPTOR (Recursive Abstractive Processing for Tree-Organised Retrieval) solves a key limitation: questions that require synthesising information spread across many documents cannot be answered by any single retrieved chunk.

How RAPTOR Works

  1. Chunk all documents normally (leaf nodes)
  2. Cluster similar chunks together using unsupervised clustering (e.g. UMAP + Gaussian Mixture Models)
  3. For each cluster, use an LLM to write a summary of the cluster
  4. Treat these summaries as new "higher-level" documents
  5. Cluster and summarise again (recursively)
  6. Build a tree from leaf chunks to high-level summaries
Level 3 (root):  "Bank dispute policy summary: disputes require Form A, 60-day window, 
                  approved by Operations Director for amounts > $10,000"
                         /                    \
Level 2:    "Foreign transaction       "Domestic dispute
             disputes section"          procedure section"
            /          \                     |
Level 1:  Chunk A    Chunk B             Chunk C
          (60 days)  (Form A)            (domestic steps)

At query time, retrieve from all levels. Broad questions retrieve high-level summaries. Specific questions retrieve leaf chunks. This gives the best of both worlds.


5.3 HyDE (Hypothetical Document Embeddings)

A simple but powerful technique. The insight: a query like "dispute deadline?" is phrased very differently from the policy document that answers it. The semantic gap can cause misses.

The Technique

Step 1 -- Original query:
"What is the deadline to dispute a foreign transaction?"

Step 2 -- LLM generates a hypothetical answer:
"Foreign transaction disputes must be initiated within 60 days of the charge date. 
The customer is required to complete Form A and provide supporting documentation."

Step 3 -- Embed the hypothetical answer (not the original query)

Step 4 -- Retrieve based on the hypothetical embedding

Why does this work? The hypothetical answer uses the same vocabulary and style as the real policy document. Its embedding is much closer to the real answer's embedding than the original short query would be.

The actual hypothetical answer does not need to be correct -- it just needs to be in the right "vector neighbourhood".


5.4 Self-RAG and Corrective RAG (CRAG)

These techniques make the RAG system self-aware -- able to evaluate its own retrieval quality and correct errors.

Self-RAG

Trains a special LLM that generates reflection tokens alongside its output. These tokens express:

  • [Retrieve]: Should I retrieve information at all?
  • [IsRel]: Is this retrieved document relevant?
  • [IsSup]: Is my generated statement supported by the retrieved document?
  • [IsUse]: Is this response useful to the user?

This allows the model to skip retrieval when it already knows the answer, and to flag when its output is not grounded in retrieved evidence.

Corrective RAG (CRAG)

CRAG adds an evaluator that scores retrieved documents for relevance. If documents score poorly:

  1. Triggered: The system detects retrieval quality is low
  2. Web search fallback: Falls back to web search to find better information
  3. Knowledge refinement: Strips irrelevant content from retrieved documents before passing to LLM
Retrieval → Relevance Evaluator
                |
    -------------------------
    |                       |
  High quality          Low quality
    |                       |
  Proceed              Web search fallback
                            |
                       Refine + Proceed

5.5 GraphRAG

GraphRAG, developed by Microsoft Research (2024), addresses the fundamental limitation of chunk-based RAG: it cannot answer questions about broad themes, global summaries, or patterns across an entire document corpus.

The Problem It Solves

Standard RAG cannot answer: "What are the top recurring themes in our dispute policy across all documents?"

No single chunk contains this answer. You would need to read everything.

How GraphRAG Works

Phase 1: Indexing (expensive, done once)

  1. Divide document corpus into chunks
  2. Use an LLM to extract entities (people, organisations, concepts) and their relationships from each chunk
  3. Build a knowledge graph
  4. Detect communities (clusters of related entities) using graph algorithms (Leiden algorithm)
  5. For each community, generate a community summary using an LLM

Phase 2: Query

There are two query modes:

  • Global search: Summarise across all community summaries to answer high-level questions about the entire corpus
  • Local search: Navigate the graph from relevant entities to find specific answers
Global query: "What are the main themes in our banking policies?"
→ LLM reads all community summaries
→ Synthesises themes across summaries
→ Answer: "The main themes are: dispute resolution (60+ docs), 
           customer verification (40+ docs), interest rate management (35+ docs)"

Local query: "Who handles foreign transaction disputes?"
→ Start from "foreign transaction" entity
→ Traverse graph to "Dispute Resolution Team" entity
→ Find related community summary
→ Answer: "The Dispute Resolution Team, managed by Alice Johnson..."

GraphRAG vs Standard RAG

AspectStandard RAGGraphRAG
Best forSpecific factual Q&ABroad thematic questions
CostLowHigh (LLM used at index time)
Global questionsPoorExcellent
Specific lookupsExcellentGood
Index sizeSmallLarge

5.6 Long-Context Strategies and Lost-in-the-Middle Problem

LLMs now support context windows of 128K-2M tokens. Should you just stuff everything in and skip retrieval?

The Lost-in-the-Middle Problem

Research (Liu et al., 2023) found that LLMs are significantly worse at using information placed in the middle of a long context versus at the beginning or end. In a 20-document context, LLMs use document 1 and document 20 well, but document 10 is often ignored.

Strategies to Combat This

Strategy 1: Retrieval before long context
Use retrieval to select the 5-10 most relevant documents, then pass only those to the LLM. You avoid stuffing irrelevant content that confuses the model.

Strategy 2: Reorder retrieved documents
Place the most relevant documents at the beginning and end of the context, not in the middle.

# After retrieval and reranking, reorder documents:
# Put most relevant at position 0 and position -1
# Put least relevant in the middle

def reorder_for_position_bias(docs: list) -> list:
    result = []
    for i, doc in enumerate(docs):
        if i % 2 == 0:
            result.append(doc)   # even-indexed go to the end
        else:
            result.insert(0, doc)  # odd-indexed go to the front
    return result

Strategy 3: Long-context + RAG hybrid
Use RAG to retrieve top-20 documents, but then pass all 20 to a long-context model. You get precision from retrieval but recover from any misses with the long context.


Part 4: Vector Databases and Storage Architecture

A vector database is purpose-built for storing embeddings and performing similarity search efficiently at scale.

Core Concepts

Searching all N vectors for the closest one (brute force) is O(N) -- too slow for millions of vectors. Vector databases use Approximate Nearest Neighbour (ANN) algorithms that trade a small amount of accuracy for massive speed gains.

Popular ANN algorithms:

AlgorithmDescriptionUsed In
HNSW (Hierarchical Navigable Small Worlds)Graph-based, very fast, high accuracyWeaviate, Qdrant, pgvector
IVF (Inverted File Index)Clusters vectors, searches only nearest clustersFAISS, Pinecone
LSH (Locality Sensitive Hashing)Hash-based, very fast, lower accuracyOlder systems
DiskANNDisk-based, handles billion-scale vectorsAzure AI Search

Major Vector Database Options

DatabaseTypeBest For
PineconeManaged cloudFast startup, production scale, no infrastructure
WeaviateOpen source / cloudRich features, hybrid search, GraphQL API
QdrantOpen source / cloudPerformance, filtering, Rust-based
ChromaOpen source, localPrototyping, local development
pgvectorPostgreSQL extensionTeams already using Postgres
MilvusOpen sourceLarge scale, enterprise on-prem
Redis VectorRedis extensionLow latency, teams already using Redis
Azure AI SearchManaged cloudAzure ecosystem, hybrid search
OpenSearchOpen source / managedTeams using OpenSearch/Elasticsearch

Choosing a Vector Database

  • Prototyping / small scale: Chroma (local) or pgvector (if you have Postgres)
  • Production, fully managed: Pinecone or Weaviate Cloud
  • High performance, self-hosted: Qdrant
  • Hybrid search built-in: Weaviate or Qdrant
  • Azure ecosystem: Azure AI Search

Storage Architecture Patterns

Pattern 1: Separate Vector Store + Document Store

[Vector DB]        [Document Store (S3, MongoDB, etc.)]
  - vector          - original text
  - chunk_id   →    - full document
  - metadata        - metadata

The vector DB stores only vectors and chunk IDs. The full text lives separately. Cheaper, more flexible.

Pattern 2: All-in-One (Vector + Metadata + Text)

Modern vector databases like Weaviate and Qdrant store vectors, metadata, and text together. Simpler, but more expensive at scale.

Pattern 3: Layered Cache

L1: In-memory cache (Redis) -- most recent/frequent queries
L2: Vector database -- all indexed chunks
L3: Document store -- raw original documents

Frequently asked questions are answered from cache instantly. Rare questions go through full retrieval.


Part 5: End-to-End Architecture Example

Let us walk through a complete real-world example: a bank's internal AI assistant that answers questions from customer service agents.

Scenario

Documents: 5,000 internal policy PDFs, 10 years of compliance guidelines, product manuals, FAQ documents. Total: ~50 million tokens.

Users: 500 customer service agents asking questions like "What is the foreign transaction dispute deadline?" or "What are the eligibility criteria for a premium savings account?"

Ingestion Architecture

[Source Systems]
Confluence, SharePoint, S3 PDFs
         |
         v
[Document Loaders]     (Unstructured.io for complex PDFs, custom loaders for Confluence)
         |
         v
[Pre-processing]       (Remove headers/footers, normalise text, extract tables to JSON)
         |
         v
[Chunking]             (Recursive splitting, 512 tokens, 50-token overlap)
         |
         v
[Metadata Enrichment]  (LLM-generated summary, keywords, hypothetical questions per chunk)
         |
         v
[Embedding]            (text-embedding-3-small, 1536 dimensions, batched at 500 chunks/min)
         |
         v
[Vector Store]         (Weaviate, HNSW index, hybrid search enabled)
         |
         v
[Change Detection]     (File hashing, daily incremental re-index pipeline via Airflow)

Query Architecture

[Agent receives query from customer service agent]
         |
         v
[Query Classification]  (Simple Q&A? Complex multi-step? Needs SQL?)
         |
         v
[Query Transformation]  (Rewrite, decompose if complex)
         |
    ---------
    |       |
    v       v
[Vector  [BM25         <- Hybrid retrieval in parallel
 Search]  Search]
    |       |
    ---------
         |
         v
[Fusion]               (Reciprocal Rank Fusion, top 20 candidates)
         |
         v
[Reranking]            (Cohere rerank-english-v3.0, top 5)
         |
         v
[Contextual Compression] (Extract most relevant sentences from each chunk)
         |
         v
[LLM Generation]       (GPT-4o or Claude, with retrieved context)
         |
         v
[Grounding Check]      (Is the answer supported by sources? Self-RAG evaluation)
         |
         v
[Response + Citations]  (Answer with source document links)

Infrastructure Summary

ComponentTechnology
Document ingestionUnstructured.io + custom Python loaders
Chunking + pipelineLlamaIndex
Embedding modeltext-embedding-3-small (OpenAI)
Vector storeWeaviate (self-hosted on GCP)
RerankerCohere rerank-english-v3.0
LLMGPT-4o (Azure OpenAI)
OrchestrationLangGraph (agentic workflows)
Pipeline schedulingApache Airflow
CachingRedis (query-level caching)
ObservabilityLangSmith (tracing every retrieval + generation step)

Part 6: Evaluation and Quality Metrics

You cannot improve what you do not measure. Here are the key metrics for evaluating RAG and agentic retrieval systems.

Retrieval Metrics

Recall@K

"Of all the documents that should have been retrieved, how many were actually retrieved in the top K results?"

Example:
- Documents that should answer the query: [doc_A, doc_B]
- Retrieved top 5: [doc_C, doc_A, doc_D, doc_E, doc_B]
- Both doc_A and doc_B are in top 5

Recall@5 = 2/2 = 1.0 (perfect)

Precision@K

"Of the K documents retrieved, how many were actually relevant?"

Retrieved top 5: [doc_C, doc_A, doc_D, doc_E, doc_B]
Relevant docs in top 5: doc_A and doc_B → 2 out of 5

Precision@5 = 2/5 = 0.4

MRR (Mean Reciprocal Rank)

Measures how high the first relevant result appears. Higher is better.

First relevant result at rank 2 → MRR = 1/2 = 0.5
First relevant result at rank 1 → MRR = 1/1 = 1.0

End-to-End RAG Metrics (RAGAS Framework)

RAGAS is the industry-standard framework for evaluating RAG pipelines.

MetricWhat It MeasuresHow
FaithfulnessIs the answer grounded in the retrieved context?LLM judge
Answer RelevancyDoes the answer address the question?Embedding similarity
Context PrecisionAre retrieved contexts relevant?LLM judge
Context RecallWere all necessary contexts retrieved?LLM + ground truth
Answer CorrectnessIs the answer factually correct?LLM + ground truth
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall

score = evaluate(
    dataset=my_eval_dataset,  # contains query, answer, contexts, ground_truth
    metrics=[faithfulness, answer_relevancy, context_recall]
)

# Output:
# faithfulness: 0.92     (92% of claims supported by retrieved context)
# answer_relevancy: 0.88 (88% of answer directly addresses the query)
# context_recall: 0.85   (85% of necessary info was retrieved)

Observability and Tracing

In production, trace every request through the pipeline to debug failures.

Tools:

  • LangSmith: Full tracing of LangChain/LangGraph pipelines
  • Arize Phoenix: Open source, model performance monitoring
  • Langfuse: Open source, supports multiple frameworks
  • Weights and Biases: Experiment tracking + tracing

A trace records: the original query, the transformed queries, which chunks were retrieved (with scores), the reranked order, the full LLM prompt, and the final answer. When something goes wrong, you can inspect the exact retrieval step that caused the failure.


Glossary

TermDefinition
AgentAn AI system that can take multi-step actions using tools to complete a goal
ANN (Approximate Nearest Neighbour)Algorithm for fast approximate similarity search over vectors
BM25Classic keyword-based retrieval algorithm used in search engines
ChunkA smaller piece of a document, created by splitting during ingestion
ChunkingThe process of splitting documents into smaller, searchable pieces
Context windowThe maximum amount of text an LLM can read at once
Cosine similarityA measure of how similar two vectors are (1 = identical, 0 = unrelated)
Cross-encoderA model that scores relevance by reading query and document together
Dense retrievalSemantic search using vector embeddings
EmbeddingA list of numbers that represents the meaning of text in vector space
GraphRAGA RAG technique that uses knowledge graphs and community summaries
HallucinationWhen an LLM generates confident but incorrect information
HNSWAn efficient graph-based algorithm for approximate nearest neighbour search
Hybrid retrievalCombining dense (vector) and sparse (keyword) retrieval
HyDETechnique that embeds a hypothetical answer to improve retrieval
IngestionThe process of loading, processing, and indexing documents before retrieval
Knowledge graphA graph structure storing entities and their relationships
LLMLarge Language Model (e.g. GPT-4, Claude, Gemini)
MetadataStructured information about a chunk (source, date, author, section)
Multi-hop queryA query requiring traversal of multiple steps or documents to answer
Naive RAGBasic retrieve-then-generate pipeline without optimisations
OCROptical Character Recognition -- converting images of text to machine-readable text
RAPTORHierarchical indexing technique using recursive clustering and summarisation
RAGRetrieval-Augmented Generation -- grounding LLM responses in retrieved documents
RAGASA framework for evaluating RAG pipeline quality
RerankingA second-pass scoring step to improve the ordering of retrieved results
RRF (Reciprocal Rank Fusion)Algorithm for combining ranked lists from multiple retrieval methods
Self-RAGA technique where the LLM evaluates its own retrieval and generation quality
Sparse retrievalKeyword-based retrieval (BM25) using inverted indices
VectorA list of numbers representing a point in high-dimensional space
Vector databaseA database purpose-built for storing and searching embedding vectors

This document covers the state of the art as of mid-2025. The field moves quickly. Core concepts like chunking, embeddings, hybrid retrieval, and reranking are stable. Specific models and tools evolve rapidly.