Agentic AI: Data Ingestion and Retrieval Techniques
A Comprehensive Lesson from First Principles
Table of Contents
- What is Agentic AI?
- Why Ingestion and Retrieval Matter
- Part 1: Data Ingestion Techniques
- Part 2: Data Retrieval Techniques
- 4.1 Dense Retrieval (Vector Search)
- 4.2 Sparse Retrieval (Keyword Search / BM25)
- 4.3 Hybrid Retrieval
- 4.4 Reranking
- 4.5 Query Transformation Techniques
- 4.6 Contextual Compression and Filtering
- 4.7 Agentic / Multi-Step Retrieval
- 4.8 Structured Data Retrieval (Text-to-SQL)
- 4.9 Knowledge Graph Retrieval
- Part 3: Advanced Techniques
- Part 4: Vector Databases and Storage Architecture
- Part 5: End-to-End Architecture Example
- Part 6: Evaluation and Quality Metrics
- Glossary
Introduction
1. What is Agentic AI?
Before diving into ingestion and retrieval, it helps to understand what "Agentic AI" actually means.
A traditional AI chatbot takes a question and generates a single response from its pre-trained knowledge. That is it. It has no memory of your documents, no ability to look things up, and no ability to take actions.
Agentic AI is different. It is a system where an AI (typically a Large Language Model, or LLM) can:
- Use tools (search the web, query a database, run code, send emails)
- Retrieve knowledge from your own private documents and data
- Take multi-step actions, deciding what to do next at each step
- Remember context across a conversation or workflow
Think of an agent as a capable employee who, rather than only drawing on what they memorised in university, can also reach into a filing cabinet, search a database, run a calculation, and call a colleague, all to answer your question.
Example to Ground This
Imagine a bank has 10,000 internal policy documents. A customer asks: "What is the process for disputing a foreign transaction?"
A traditional LLM would guess or hallucinate an answer. An agentic AI would:
- Search the internal document store for relevant policy documents
- Pull the most relevant chunks of text
- Reason over those chunks
- Produce an accurate answer grounded in real documents
The magic that makes this possible is ingestion (loading and indexing documents) and retrieval (finding the right document chunks at query time).
2. Why Ingestion and Retrieval Matter
LLMs have a fixed "knowledge cutoff" (they only know what they were trained on) and a limited "context window" (they can only read a certain amount of text at once). You cannot just dump 10,000 documents into an LLM prompt.
The solution is Retrieval-Augmented Generation (RAG):
- At ingestion time: Process and store your documents in a searchable format
- At query time: Retrieve only the most relevant pieces and pass them to the LLM
This means the LLM only needs to read 5-10 document snippets at a time instead of 10,000 full documents. Retrieval solves the "needle in a haystack" problem efficiently.
Part 1: Data Ingestion Techniques
Ingestion is everything that happens before a user asks a question. You are preparing your data so it can be found quickly and accurately later.
3.1 Raw Document Ingestion
The first step is simply getting your data into the pipeline. Documents come in many formats and each requires different handling.
Document Loaders
Document loaders are components that read raw files and produce plain text (plus metadata like filename, page number, author).
| Source Type | Examples | Tool/Approach |
|---|---|---|
| PDFs | Policy docs, contracts, reports | PyMuPDF, pdfplumber, Unstructured.io |
| Word (.docx) | HR policies, proposals | python-docx, Unstructured.io |
| Web pages | Product docs, Wikipedia | BeautifulSoup, Playwright, Firecrawl |
| Databases | SQL tables, NoSQL records | SQLAlchemy connectors |
| APIs | Confluence, Notion, Salesforce | LangChain / LlamaIndex integrations |
| Code | GitHub repositories | AST parsers, tree-sitter |
| Audio/Video | Meetings, podcasts | Whisper (speech-to-text) then text |
| Spreadsheets | Excel, CSV | pandas, openpyxl |
| Outlook, Gmail threads | MIME parsers |
Example: Ingesting a PDF Policy Document
from llama_index.readers.file import PDFReader
loader = PDFReader()
documents = loader.load_data(file="bank_dispute_policy.pdf")
# Result: A list of Document objects, one per page
# Each Document has:
# - text: "To dispute a foreign transaction, the customer must..."
# - metadata: {"file_name": "bank_dispute_policy.pdf", "page_label": "4"}Key Challenge: Noisy Input
Raw documents are messy. PDFs often contain:
- Headers and footers that repeat on every page ("CONFIDENTIAL | PAGE 4")
- Tables that extract as garbled text
- Scanned images that contain no machine-readable text
Solutions:
- Use OCR (Optical Character Recognition) tools like Tesseract for scanned documents
- Use Unstructured.io which handles layout-aware extraction for complex PDFs
- Pre-process to strip boilerplate (page numbers, legal disclaimers that appear on every page)
3.2 Chunking Strategies
After loading, you need to split documents into smaller pieces called chunks. You do this because:
- An LLM's context window is limited (you can only feed it so much text)
- Embedding models (covered next) work best on focused, short passages
- Retrieval precision improves when chunks are topically focused
Think of chunking like cutting a textbook into flash cards. Each flash card covers one concept clearly.
Strategy 1: Fixed-Size Chunking
Split every N characters or tokens, regardless of content.
Input: "The customer must submit Form A. Then the team reviews it. Approval takes 5 days."
Chunk 1 (100 chars): "The customer must submit Form A. Then the team reviews"
Chunk 2 (100 chars): " it. Approval takes 5 days."Problem: Chunks can split mid-sentence, losing meaning. Chunk 2 starts with "it" but "it" refers to something in Chunk 1.
Fix: Use overlap. Repeat 10-20% of the previous chunk at the start of the next.
Chunk 1: "The customer must submit Form A. Then the team reviews it."
Chunk 2: "Then the team reviews it. Approval takes 5 days." ← overlapsThis ensures context is not lost across boundaries.
Strategy 2: Recursive Character Splitting
Split by natural boundaries in this priority order: paragraph breaks, then sentence breaks, then word breaks. This is the most common default approach.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # max characters per chunk
chunk_overlap=50, # overlap between chunks
separators=["\n\n", "\n", ". ", " "] # tries these in order
)
chunks = splitter.split_text(document_text)The splitter first tries to break on double newlines (paragraphs). If a paragraph is still too big, it tries single newlines (sentences). And so on.
Strategy 3: Semantic Chunking
Instead of splitting at fixed sizes, split at semantic boundaries -- where the topic actually changes.
How it works:
- Split text into sentences
- Embed each sentence (convert to a vector, explained in section 3.3)
- Compute similarity between adjacent sentences
- When similarity drops sharply, it indicates a topic change -- cut there
Sentence 1: "To dispute a charge, call 1800-XXX." ← topic: disputes
Sentence 2: "You can also use our mobile app." ← topic: disputes
Sentence 3: "Our savings accounts offer 3.5% interest." ← TOPIC CHANGE
Sentence 4: "Rates are reviewed quarterly." ← topic: savingsSemantic chunking would produce one chunk for sentences 1-2 and another for 3-4. This is far more natural than cutting at character 500.
Tool: LangChain's SemanticChunker or LlamaIndex's SemanticSplitterNodeParser.
Strategy 4: Document Structure-Aware Chunking
Use the document's own structure (headings, sections, paragraphs) as boundaries.
# Section: Dispute Process <- Heading 1
## Step 1: Submit a request <- Heading 2
Fill out Form A online... <- content chunk under Heading 2
## Step 2: Wait for review <- Heading 2
The team will contact you... <- content chunk under Heading 2Each chunk retains which section it came from. This is powerful because you can later filter by section.
Strategy 5: Agentic / Proposition-Based Chunking
This is a newer, more sophisticated approach. Use an LLM itself to reformulate each paragraph into a set of standalone, self-contained propositions.
Before:
"It can be submitted by the customer or their representative."
After (LLM-generated propositions):
"A dispute form can be submitted by the customer."
"A dispute form can be submitted by the customer's representative."
Each proposition is a complete, self-contained fact. This dramatically improves retrieval because there is no ambiguity about what "it" refers to.
Choosing Chunk Size: Rules of Thumb
| Use Case | Recommended Chunk Size |
|---|---|
| Factual Q&A (precise answers needed) | 128-256 tokens |
| General document Q&A | 256-512 tokens |
| Summarisation tasks | 512-1024 tokens |
| Code retrieval | Whole function or class |
Always test empirically -- wrong chunk size is one of the top reasons RAG systems fail.
3.3 Embedding: Turning Text into Numbers
This is the most important concept in modern AI retrieval. To search for text by meaning (not just keywords), you need to convert text into vectors -- lists of numbers that capture semantic meaning.
What is an Embedding?
An embedding is a fixed-length list of decimal numbers that represents the "meaning" of a piece of text in a high-dimensional mathematical space.
"How do I dispute a charge?"
→ [0.23, -0.87, 0.45, 0.12, -0.33, ..., 0.67] (e.g. 1536 numbers)
"Steps to challenge a transaction"
→ [0.21, -0.85, 0.44, 0.14, -0.31, ..., 0.65] (very similar numbers!)
"What is the weather today?"
→ [-0.55, 0.12, -0.78, 0.93, 0.44, ..., -0.22] (very different numbers)Texts with similar meanings produce vectors that are close together in this high-dimensional space. Texts with different meanings produce vectors that are far apart.
This is called the semantic similarity property of embeddings.
How Similarity is Measured
The most common measure is cosine similarity. It measures the angle between two vectors. If the angle is small (vectors point in the same direction), the texts are semantically similar.
Cosine similarity of 1.0 = identical meaning
Cosine similarity of 0.9 = very similar
Cosine similarity of 0.5 = somewhat related
Cosine similarity of 0.0 = completely unrelatedEmbedding Models
Different models produce embeddings of different quality and dimensions. Here are the most commonly used ones in 2024-2025:
| Model | Provider | Dimensions | Notes |
|---|---|---|---|
| text-embedding-3-large | OpenAI | 3072 | Best general quality |
| text-embedding-3-small | OpenAI | 1536 | Fast, cheap, very good |
| embed-english-v3.0 | Cohere | 1024 | Strong, good for enterprise |
| voyage-3 | Voyage AI | 1024 | State of the art for RAG |
| nomic-embed-text | Nomic (open source) | 768 | Free, runs locally |
| bge-large-en-v1.5 | BAAI (open source) | 1024 | Strong open source option |
| mxbai-embed-large | MixedBread (open source) | 1024 | Excellent, can run locally |
Example: Embedding Chunks
from openai import OpenAI
client = OpenAI()
def embed(text: str) -> list[float]:
response = client.embeddings.create(
input=text,
model="text-embedding-3-small"
)
return response.data[0].embedding # list of 1536 floats
chunk = "To dispute a foreign transaction, the customer must call within 60 days."
vector = embed(chunk)
# vector = [0.23, -0.12, ..., 0.67] (1536 numbers)This vector is what gets stored in a vector database alongside the original text.
Important: Embedding Symmetry
You must embed both documents AND queries using the same model. If you embed documents with model A and queries with model B, similarity scores will be meaningless.
Some models are asymmetric -- they use a different prefix for queries vs documents. For example, with bge-large-en-v1.5:
# Documents are embedded without a prefix
doc_vector = embed("dispute process requires Form A")
# Queries use a special prefix
query_vector = embed("Represent this sentence for searching relevant passages: How do I dispute a charge?")Always check the model documentation to know if asymmetric embedding is required.
3.4 Metadata Enrichment
When you store a chunk in a vector database, you should always store metadata alongside the vector. Metadata is structured information about the chunk that enables filtering.
Why Metadata Matters
Without metadata, you search all chunks equally. With metadata, you can narrow down:
"Only search in Policy documents from the Retail Banking division, published after 2023."
Types of Metadata to Attach
Source metadata (automatically extractable):
{
"source_file": "retail_banking_disputes_policy_v3.pdf",
"page_number": 4,
"file_type": "pdf",
"created_date": "2024-01-15",
"last_modified": "2024-08-01"
}Structural metadata (from document structure):
{
"section_heading": "Foreign Transaction Disputes",
"parent_heading": "Dispute Resolution",
"document_type": "policy",
"chapter": 3
}Semantic metadata (LLM-generated at ingestion time):
{
"summary": "This chunk explains the 60-day window for disputing foreign charges.",
"keywords": ["dispute", "foreign transaction", "60 days", "Form A"],
"entities": ["Form A", "Dispute Resolution Team"],
"topic": "dispute process",
"audience": "retail banking customers"
}Using an LLM to generate summaries and keywords at ingestion time is expensive but significantly improves retrieval quality.
Hypothetical Questions Metadata
A powerful technique: for each chunk, use an LLM to generate 3-5 hypothetical questions that the chunk would answer. Store these as metadata.
Chunk: "To dispute a foreign transaction, the customer must submit Form A within 60 days."
Generated questions:
- "What is the deadline to dispute a foreign charge?"
- "Which form do I use to challenge an overseas transaction?"
- "How long do I have to report an incorrect foreign payment?"When a user asks one of these questions, you can match against both the chunk text AND the pre-generated questions, dramatically improving recall.
3.5 Multimodal Ingestion (Images, Audio, Tables)
Modern agentic AI is not limited to text. Here is how non-text content is ingested.
Tables
Tables in PDFs are notoriously hard to extract. Options:
Option 1: Convert to text description
Use an LLM or a specialised tool to describe the table in natural language.
Original table:
| Product | Rate | Min Balance |
| Savings | 3.5% | $500 |
| Premium | 4.2% | $10,000 |
Converted to text:
"The Savings account offers a 3.5% rate with a minimum balance of $500.
The Premium account offers 4.2% with a minimum balance of $10,000."Option 2: Store as structured JSON and index both the JSON and a text description
Option 3: Use table-aware extraction tools like Unstructured.io or Camelot (for PDFs).
Images
Option 1: Vision LLM captioning
Pass each image through a multimodal LLM (like GPT-4o or Claude) and generate a text caption/description. Store the description as a searchable text chunk.
# At ingestion time, for each image in a document:
caption = vision_llm.describe(image)
# "A bar chart showing monthly transaction disputes from Jan-Dec 2023,
# with a peak of 1,245 disputes in August."
# Store caption as an embeddable chunk alongside source metadataOption 2: Multimodal Embeddings
Models like CLIP or OpenAI's multimodal embeddings can embed both text and images in the same vector space. This allows you to search images with text queries directly.
Audio
- Transcribe audio to text using Whisper (OpenAI's open-source model) or AWS Transcribe
- Add speaker labels and timestamps as metadata
- Chunk the transcript by speaker turn or time window
- Embed and index like regular text
3.6 Knowledge Graph Ingestion
Rather than storing text chunks in isolation, a Knowledge Graph stores entities and the relationships between them.
Example
From the text "Alice manages the Dispute Resolution Team, which reports to the Operations Division":
A knowledge graph would extract:
- Entity: Alice (type: Person)
- Entity: Dispute Resolution Team (type: Team)
- Entity: Operations Division (type: Division)
- Relationship: Alice MANAGES Dispute Resolution Team
- Relationship: Dispute Resolution Team REPORTS_TO Operations Division
This is stored as a graph (nodes and edges) rather than text chunks.
Why This Matters for Agentic AI
Graphs excel at multi-hop queries -- questions that require connecting information across multiple documents.
"Who is ultimately responsible for approving foreign transaction disputes?"
To answer this, an agent might need to traverse:
- Alice manages the Dispute Resolution Team
- The team reports to the Operations Division
- The Operations Division head is Bob
- Bob's approval is required for disputes over $10,000
A flat text-chunk system would struggle. A graph makes this traversal explicit and efficient.
Tools for Knowledge Graph Ingestion
- LlamaIndex PropertyGraphIndex: Extracts entities and relations from text using an LLM and stores in a graph
- Microsoft GraphRAG: Builds community-level summaries and graphs from large corpora
- Neo4j + LangChain: Store and query property graphs
- NebulaGraph: Distributed graph database for large-scale deployments
3.7 Ingestion Pipelines and Orchestration
In production, ingestion is not a one-time batch job. Documents are added, updated, and deleted continuously. You need a pipeline.
Pipeline Stages
[Source Documents]
|
v
[Document Loader] <- Read raw files (PDF, DOCX, web, API)
|
v
[Pre-processing] <- Clean noise, extract structure, OCR if needed
|
v
[Chunking] <- Split into appropriately-sized pieces
|
v
[Metadata Enrichment] <- Add source info, LLM-generated summaries/questions
|
v
[Embedding] <- Convert each chunk to a vector
|
v
[Vector Store] <- Store vectors + metadata (Pinecone, Weaviate, etc.)
|
v
[Ready for Retrieval] <- Queryable indexChange Detection and Incremental Ingestion
You do not want to re-ingest 10,000 documents every time one file changes. Solutions:
- File hashing: Store an MD5 or SHA hash of each document. Only re-ingest if the hash changes.
- Timestamp tracking: Only re-ingest files modified since the last pipeline run.
- Soft deletes: When a document is deleted, mark its chunks as inactive in the vector store rather than hard-deleting (for audit trails).
Popular Ingestion Frameworks
| Framework | Description |
|---|---|
| LangChain | General-purpose, massive ecosystem of loaders and splitters |
| LlamaIndex | Optimised specifically for RAG and indexing |
| Unstructured.io | Best-in-class for complex document parsing |
| Haystack | Production-ready ML pipelines |
| Apache Airflow | General workflow orchestration (pairs well with the above) |
Part 2: Data Retrieval Techniques
Retrieval is what happens at query time, when a user asks a question. The goal: find the most relevant chunks from your indexed store to give the LLM the right context.
4.1 Dense Retrieval (Vector Search)
Dense retrieval is the foundation of modern RAG. It uses the same embedding approach from ingestion.
How It Works
- User asks: "What is the deadline to dispute a foreign transaction?"
- Convert the query to a vector using the same embedding model used at ingestion
- Search the vector database for the chunks whose vectors are closest to the query vector
- Return the top-K most similar chunks (typically K=3 to K=10)
query = "What is the deadline to dispute a foreign transaction?"
query_vector = embed(query) # [0.21, -0.85, 0.44, ...]
# Search vector database
results = vector_db.query(
vector=query_vector,
top_k=5, # Return top 5 results
include_metadata=True
)
# results contains the 5 most semantically similar chunks
for result in results:
print(result.score) # e.g. 0.92 (cosine similarity)
print(result.text) # "To dispute a foreign transaction, submit Form A within 60 days..."
print(result.metadata) # {"source": "policy.pdf", "page": 4}Strengths
- Finds semantic matches even when exact words differ ("deadline" matches "time limit")
- Works across paraphrases, synonyms, different phrasings
- Handles multilingual queries well with multilingual embedding models
Weaknesses
- Can miss exact keyword matches (searching "Form A" might not return the exact form reference)
- Performance degrades for highly technical, domain-specific terminology not well-represented in training data
- Requires an embedding model and vector store infrastructure
4.2 Sparse Retrieval (Keyword Search / BM25)
Sparse retrieval is the classic approach used by search engines for decades. It is based on matching keywords, not semantic meaning.
BM25 (Best Match 25)
BM25 is the industry standard sparse retrieval algorithm. It ranks documents based on:
- Term frequency: How often does the query word appear in the document?
- Inverse document frequency: Is the word rare (high value) or common (low value)?
- Document length normalisation: Longer documents should not be unfairly favoured
Query: "Form A dispute process"
BM25 scores each chunk based on how many query words appear,
how rare those words are across all chunks, and chunk length.
Chunk 1: "Submit Form A within 60 days..." → score: 8.4 (high -- "Form A" is specific)
Chunk 2: "The dispute process begins with..." → score: 6.2 (matches "dispute process")
Chunk 3: "Contact us for any banking needs..." → score: 0.1 (no keywords match)Strengths
- Excellent for exact keyword matches (product codes, names, form numbers)
- Fast, no GPU required
- Interpretable (you know exactly why a result was returned)
- Works well for domain-specific jargon
Weaknesses
- Cannot handle synonyms or paraphrases (will miss "deadline" if the doc says "time limit")
- Sensitive to typos
- Requires users to use the "right" vocabulary
Tools
- OpenSearch / Elasticsearch (industrial-scale BM25 with many enhancements)
- Apache Lucene (underlying engine behind Elasticsearch)
- BM25s (fast Python implementation)
- Weaviate, Qdrant, and most vector databases now include BM25 as a built-in option
4.3 Hybrid Retrieval
Since dense (semantic) and sparse (keyword) retrieval each have complementary strengths, modern systems combine both. This is called hybrid retrieval and it consistently outperforms either approach alone.
How It Works
- Run the query through both a vector search and a BM25 search in parallel
- Get two separate ranked result lists
- Fuse (combine) the two lists into a single ranked list
- Return the top results
Dense results (semantic):
1. "The time limit for challenging international charges is 60 days." (score: 0.91)
2. "Customers may contest foreign payments by filling in Form A." (score: 0.88)
3. "Dispute windows close two months after the transaction date." (score: 0.85)
Sparse results (BM25):
1. "Submit Form A within 60 days to dispute foreign transactions." (score: 9.2)
2. "Form A is required for all foreign dispute submissions." (score: 7.8)
3. "See the dispute policy for Form A submission guidelines." (score: 6.1)
After fusion:
1. "Submit Form A within 60 days..." (appeared in BOTH lists -- highest confidence)
2. "Customers may contest foreign payments..." (top semantic match)
3. "Form A is required for all foreign dispute submissions." (strong keyword match)Reciprocal Rank Fusion (RRF)
RRF is the most common fusion algorithm. For each result in each list, it assigns a score based on rank position, then sums scores across lists.
RRF score for a document d = sum over each list of: 1 / (k + rank(d))
where k is a constant (typically 60)A document ranked #1 in both lists scores much higher than one ranked #5 in one list and absent in another. This simple formula is robust and works well in practice.
# Pseudocode for RRF
def reciprocal_rank_fusion(dense_results, sparse_results, k=60):
scores = {}
for rank, doc in enumerate(dense_results):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)
for rank, doc in enumerate(sparse_results):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)Tools Supporting Hybrid Retrieval
- Weaviate: Native hybrid search with RRF
- Qdrant: Built-in sparse + dense fusion
- Elasticsearch: Reciprocal rank fusion support
- LangChain EnsembleRetriever: Combines any two retrievers with RRF
- Pinecone: Hybrid search with sparse-dense index
4.4 Reranking
After retrieval, you have a shortlist of candidate chunks (e.g. the top 20). Reranking is a second, more expensive scoring step that re-orders them to pick the best 5.
Why Not Just Use the Top-K from Retrieval?
First-stage retrieval (vector search or BM25) is fast but approximate. It finds "probably relevant" documents. A reranker does a deeper, more accurate relevance assessment.
Analogy: think of retrieval as a recruiter shortlisting 20 CVs, and reranking as the hiring manager picking the top 5 for interview. The recruiter is fast but may miss nuance. The manager takes more time but makes better decisions.
Cross-Encoder Rerankers
Cross-encoders take the query and each candidate document together as input and output a single relevance score. Unlike embeddings (which encode query and document separately), cross-encoders see both simultaneously, enabling richer comparison.
Input to cross-encoder:
[CLS] What is the deadline to dispute a foreign transaction? [SEP]
Submit Form A within 60 days to dispute foreign transactions. [SEP]
Output: 0.97 (very relevant)
Input:
[CLS] What is the deadline to dispute a foreign transaction? [SEP]
Contact us for any banking needs. [SEP]
Output: 0.04 (not relevant)Popular Reranking Models
| Model | Provider | Notes |
|---|---|---|
| rerank-english-v3.0 | Cohere | Excellent, API-based |
| rerank-3.5 | Cohere | Latest, best quality |
| bge-reranker-large | BAAI | Open source, runs locally |
| ms-marco-MiniLM | Microsoft | Fast, good quality |
| Jina Reranker v2 | Jina AI | Multilingual support |
| rankllm | Various | Uses an LLM itself as a reranker |
Typical Retrieval + Reranking Pipeline
# Step 1: Fast retrieval -- get top 20 candidates
candidates = vector_db.query(query_vector, top_k=20)
# Step 2: Rerank -- score each candidate against the query
import cohere
co = cohere.Client(api_key)
reranked = co.rerank(
query="What is the deadline to dispute a foreign transaction?",
documents=[c.text for c in candidates],
model="rerank-english-v3.0",
top_n=5 # Keep only top 5
)
# reranked.results contains the 5 most relevant chunksThe typical pattern is: retrieve 20-50 with fast vector search, then rerank to top 5-10. This balances speed and accuracy.
4.5 Query Transformation Techniques
The user's original query is often not the best query for retrieval. Query transformation improves retrieval by modifying or expanding the query before searching.
Technique 1: Query Rewriting
Use an LLM to rewrite the query into a more explicit form that retrieval can handle better.
Original query: "What happens next after I send the form?"
Rewritten query: "What is the process after submitting Form A for a foreign transaction dispute?
What steps follow the submission and what is the expected timeline?"The rewritten query is more specific, uses more keywords, and is more likely to match relevant chunks.
Technique 2: Step-Back Prompting
When a query is very specific, sometimes you need to retrieve context at a higher level of abstraction first.
Specific query: "What happens if I dispute a charge made on 3rd March but the 60-day deadline was 2nd March?"
Step-back query: "What are the exceptions and grace periods in the foreign transaction dispute policy?"The step-back query retrieves the broader policy context, which likely contains information about exceptions, which then informs answering the specific question.
Technique 3: HyDE (Hypothetical Document Embeddings)
A clever approach: use an LLM to generate a hypothetical answer to the query, then embed and search using that hypothetical answer instead of the original query.
Query: "What is the dispute deadline for foreign transactions?"
Step 1 -- LLM generates a hypothetical answer:
"The dispute deadline for foreign transactions is typically 60 days from the transaction date.
Customers must submit Form A with supporting documentation..."
Step 2 -- Embed the hypothetical answer (not the original query)
Step 3 -- Search for chunks similar to this hypothetical answerWhy does this work? The hypothetical answer is in "document language" (the kind of text that would appear in the policy), so it aligns better with how the real answer is written in the policy document.
HyDE is covered in more depth in section 5.3.
Technique 4: Multi-Query Generation
Generate multiple variations of the original query and retrieve for each. Merge and deduplicate results.
Original query: "dispute foreign transaction"
Generated variants:
1. "How do I challenge an international charge on my account?"
2. "What is the process for contesting a foreign payment?"
3. "Steps to dispute an overseas transaction"
4. "Foreign transaction dispute form and deadline"
Retrieve for all 4 queries, merge results, deduplicateThis is sometimes called query expansion and ensures that wording variations do not cause you to miss relevant documents.
# LangChain MultiQueryRetriever
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
retriever = MultiQueryRetriever.from_llm(
retriever=vector_store.as_retriever(),
llm=llm
)
results = retriever.invoke("dispute foreign transaction")Technique 5: Decomposition (Sub-Question Generation)
For complex, multi-part queries, decompose into simpler sub-questions, retrieve for each, then synthesise.
Complex query: "Which division handles disputes, how long do they take, and who approves them?"
Decomposed:
1. "Which division handles foreign transaction disputes?"
2. "How long does the dispute resolution process take?"
3. "Who has approval authority for disputes?"
Retrieve for each sub-question separately.
Combine the 3 sets of retrieved chunks.
LLM synthesises a unified answer from all retrieved context.4.6 Contextual Compression and Filtering
Even after retrieval, chunks may contain irrelevant sentences mixed with relevant ones. Contextual compression extracts only the relevant parts.
How It Works
Retrieved chunk (full text):
"Our bank was founded in 1924. We offer personal and business banking solutions.
To dispute a foreign transaction, submit Form A within 60 days.
We also offer home loans and investment products.
Our customer service team is available 24/7."
After contextual compression (only relevant to the dispute query):
"To dispute a foreign transaction, submit Form A within 60 days."The compressed chunk is smaller, more focused, and uses less of the LLM's context window.
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vector_store.as_retriever()
)Metadata Filtering
Before or during retrieval, apply hard filters on metadata to narrow the search space.
# Only search in documents from the "Retail Banking" division
# AND published after 2023-01-01
results = vector_db.query(
vector=query_vector,
top_k=5,
filter={
"division": "Retail Banking",
"published_after": "2023-01-01",
"document_status": "active"
}
)This is equivalent to SQL's WHERE clause applied to vector search.
4.7 Agentic / Multi-Step Retrieval
This is where retrieval becomes truly agentic. Instead of a single retrieve-then-generate step, an agent iteratively retrieves, reasons, and decides whether to retrieve more.
Iterative Retrieval (ReAct Pattern)
The agent follows a Reason-Act-Observe loop:
User: "Who approves disputes over $10,000 and how should I contact them?"
Agent loop:
THOUGHT 1: "I need to find who approves disputes over $10,000."
ACTION 1: Search("approval authority for disputes over $10,000")
OBSERVATION 1: "Disputes over $10,000 require sign-off from the Operations Director."
THOUGHT 2: "Now I need contact details for the Operations Director."
ACTION 2: Search("Operations Director contact details")
OBSERVATION 2: "Operations Director: Sarah Chen, sarah.chen@bank.com, ext. 4521"
THOUGHT 3: "I have enough information to answer fully."
FINAL ANSWER: "Disputes over $10,000 are approved by the Operations Director,
Sarah Chen. She can be reached at sarah.chen@bank.com or extension 4521."Each retrieval is informed by the result of the previous one. This is fundamentally different from a single-step RAG system.
Self-Querying Retrieval
The agent automatically constructs metadata filters from natural language.
User query: "What disputes policies were updated last year in the retail division?"
Self-querying extracts:
- Semantic search query: "disputes policies"
- Metadata filters:
division = "retail"
updated_date >= "2024-01-01" AND updated_date <= "2024-12-31"The LLM parses the natural language query into both a semantic component and structured filters, then executes a filtered vector search.
Parent-Child Retrieval
A two-level chunking strategy that improves context quality:
- Child chunks: Small, precise chunks (128 tokens) used for embedding and retrieval. Small size = high precision in matching.
- Parent chunks: Larger chunks (512 tokens) returned to the LLM after a child chunk is matched. Larger size = more context for the LLM to reason over.
Parent chunk (what LLM sees):
"[Section 4: Foreign Transactions]
All foreign transaction disputes must be initiated within 60 days of the transaction date.
The customer must complete Form A, available at any branch or online at bank.com/forms.
Supporting documentation such as bank statements should be attached.
Disputes initiated after 60 days will be declined except in cases of fraud or system error..."
Child chunk (what was matched by vector search):
"The customer must complete Form A, available at any branch or online at bank.com/forms."The child chunk matched the query precisely. But the LLM receives the full parent section as context, giving it the surrounding information needed to reason correctly.
4.8 Structured Data Retrieval (Text-to-SQL)
Not all data is in documents. Often the most important data lives in databases. Agentic AI can query structured databases using natural language through Text-to-SQL.
How Text-to-SQL Works
User: "How many disputes were filed in March 2024 and what was the average resolution time?"
Step 1 -- Agent retrieves database schema:
Tables: disputes(id, customer_id, date_filed, date_resolved, amount, status, type)
customers(id, name, account_type, region)
Step 2 -- LLM generates SQL:
SELECT
COUNT(*) as total_disputes,
AVG(DATEDIFF(date_resolved, date_filed)) as avg_resolution_days
FROM disputes
WHERE date_filed >= '2024-03-01' AND date_filed < '2024-04-01'
AND date_resolved IS NOT NULL;
Step 3 -- Execute SQL, return result:
{"total_disputes": 347, "avg_resolution_days": 8.2}
Step 4 -- LLM formats natural language answer:
"In March 2024, there were 347 disputes filed. The average resolution time was 8.2 days."Making Text-to-SQL Reliable
Text-to-SQL fails silently -- wrong SQL returns wrong numbers with no error. Mitigation strategies:
- Schema enrichment: Add rich descriptions to each column (not just
amount, butamount: The disputed transaction amount in AUD, always positive) - Few-shot examples: Include 5-10 example query-SQL pairs in the prompt to guide the LLM
- SQL validation: Run EXPLAIN before EXECUTE to catch syntax errors
- Row-limit safety: Always append LIMIT 100 to prevent full table scans
- Read-only connections: Never give the agent a connection with INSERT/UPDATE/DELETE rights
Tools
- LangChain SQLDatabaseChain: Simple Text-to-SQL integration
- LlamaIndex NLSQLTableQueryEngine: More sophisticated with schema awareness
- Vanna.ai: Trains a Text-to-SQL model on your specific database
- DSPY: Programmatic prompt optimisation for Text-to-SQL
4.9 Knowledge Graph Retrieval
For queries requiring multi-hop reasoning, knowledge graph retrieval traverses relationships between entities.
Graph Traversal Retrieval
Query: "Which team handles disputes and who manages them?"
Graph traversal:
1. Start at entity: "foreign transaction dispute" (matches query)
2. Traverse: dispute HANDLED_BY "Dispute Resolution Team"
3. Traverse: "Dispute Resolution Team" MANAGED_BY "Alice Johnson"
Answer: "Foreign transaction disputes are handled by the Dispute Resolution Team,
managed by Alice Johnson."This multi-hop traversal would be very difficult with flat text retrieval, as the relationship might not be stated explicitly in a single chunk.
Cypher Query Generation (Neo4j)
Similar to Text-to-SQL, an LLM can generate graph query language.
User: "Find all policies that Alice Johnson is responsible for."
LLM generates Cypher:
MATCH (p:Person {name: "Alice Johnson"})-[:RESPONSIBLE_FOR]->(doc:Document)
RETURN doc.title, doc.last_updated
Result: [
{"title": "Foreign Transaction Dispute Policy", "last_updated": "2024-01-15"},
{"title": "Chargeback Guidelines", "last_updated": "2023-11-20"}
]Part 3: Advanced Techniques
5.1 RAG Variants
RAG has evolved from a simple "retrieve then generate" pattern into a family of increasingly sophisticated architectures.
Naive RAG (The Baseline)
Query → Embed → Vector Search → Top-K Chunks → LLM → AnswerSimple pipeline. Works for straightforward Q&A but has clear limitations:
- One retrieval step (no iteration)
- No query transformation
- No reranking
- Poor handling of complex, multi-part questions
Advanced RAG
Adds improvements at every stage:
- Pre-retrieval: query rewriting, query decomposition
- Retrieval: hybrid search (dense + sparse), metadata filtering
- Post-retrieval: reranking, contextual compression
Still a single-pass pipeline but much higher quality.
Modular RAG
Treats each component (retriever, reranker, generator, memory) as a swappable module. You compose a pipeline from modules based on the task.
For a simple FAQ: Retriever → Generator
For complex research: QueryDecomposer → MultiRetriever → Reranker → Synthesiser → Generator
Agentic RAG
The LLM actively controls the retrieval process. It decides:
- Whether to retrieve at all (maybe it already knows the answer)
- What to retrieve (formulates its own queries)
- Whether the retrieved information is sufficient (or whether to search again)
- Which tool to use (vector search, SQL, web search, knowledge graph)
This is the state of the art and is what production agentic systems look like.
5.2 RAPTOR and Hierarchical Indexing
RAPTOR (Recursive Abstractive Processing for Tree-Organised Retrieval) solves a key limitation: questions that require synthesising information spread across many documents cannot be answered by any single retrieved chunk.
How RAPTOR Works
- Chunk all documents normally (leaf nodes)
- Cluster similar chunks together using unsupervised clustering (e.g. UMAP + Gaussian Mixture Models)
- For each cluster, use an LLM to write a summary of the cluster
- Treat these summaries as new "higher-level" documents
- Cluster and summarise again (recursively)
- Build a tree from leaf chunks to high-level summaries
Level 3 (root): "Bank dispute policy summary: disputes require Form A, 60-day window,
approved by Operations Director for amounts > $10,000"
/ \
Level 2: "Foreign transaction "Domestic dispute
disputes section" procedure section"
/ \ |
Level 1: Chunk A Chunk B Chunk C
(60 days) (Form A) (domestic steps)At query time, retrieve from all levels. Broad questions retrieve high-level summaries. Specific questions retrieve leaf chunks. This gives the best of both worlds.
5.3 HyDE (Hypothetical Document Embeddings)
A simple but powerful technique. The insight: a query like "dispute deadline?" is phrased very differently from the policy document that answers it. The semantic gap can cause misses.
The Technique
Step 1 -- Original query:
"What is the deadline to dispute a foreign transaction?"
Step 2 -- LLM generates a hypothetical answer:
"Foreign transaction disputes must be initiated within 60 days of the charge date.
The customer is required to complete Form A and provide supporting documentation."
Step 3 -- Embed the hypothetical answer (not the original query)
Step 4 -- Retrieve based on the hypothetical embeddingWhy does this work? The hypothetical answer uses the same vocabulary and style as the real policy document. Its embedding is much closer to the real answer's embedding than the original short query would be.
The actual hypothetical answer does not need to be correct -- it just needs to be in the right "vector neighbourhood".
5.4 Self-RAG and Corrective RAG (CRAG)
These techniques make the RAG system self-aware -- able to evaluate its own retrieval quality and correct errors.
Self-RAG
Trains a special LLM that generates reflection tokens alongside its output. These tokens express:
- [Retrieve]: Should I retrieve information at all?
- [IsRel]: Is this retrieved document relevant?
- [IsSup]: Is my generated statement supported by the retrieved document?
- [IsUse]: Is this response useful to the user?
This allows the model to skip retrieval when it already knows the answer, and to flag when its output is not grounded in retrieved evidence.
Corrective RAG (CRAG)
CRAG adds an evaluator that scores retrieved documents for relevance. If documents score poorly:
- Triggered: The system detects retrieval quality is low
- Web search fallback: Falls back to web search to find better information
- Knowledge refinement: Strips irrelevant content from retrieved documents before passing to LLM
Retrieval → Relevance Evaluator
|
-------------------------
| |
High quality Low quality
| |
Proceed Web search fallback
|
Refine + Proceed5.5 GraphRAG
GraphRAG, developed by Microsoft Research (2024), addresses the fundamental limitation of chunk-based RAG: it cannot answer questions about broad themes, global summaries, or patterns across an entire document corpus.
The Problem It Solves
Standard RAG cannot answer: "What are the top recurring themes in our dispute policy across all documents?"
No single chunk contains this answer. You would need to read everything.
How GraphRAG Works
Phase 1: Indexing (expensive, done once)
- Divide document corpus into chunks
- Use an LLM to extract entities (people, organisations, concepts) and their relationships from each chunk
- Build a knowledge graph
- Detect communities (clusters of related entities) using graph algorithms (Leiden algorithm)
- For each community, generate a community summary using an LLM
Phase 2: Query
There are two query modes:
- Global search: Summarise across all community summaries to answer high-level questions about the entire corpus
- Local search: Navigate the graph from relevant entities to find specific answers
Global query: "What are the main themes in our banking policies?"
→ LLM reads all community summaries
→ Synthesises themes across summaries
→ Answer: "The main themes are: dispute resolution (60+ docs),
customer verification (40+ docs), interest rate management (35+ docs)"
Local query: "Who handles foreign transaction disputes?"
→ Start from "foreign transaction" entity
→ Traverse graph to "Dispute Resolution Team" entity
→ Find related community summary
→ Answer: "The Dispute Resolution Team, managed by Alice Johnson..."GraphRAG vs Standard RAG
| Aspect | Standard RAG | GraphRAG |
|---|---|---|
| Best for | Specific factual Q&A | Broad thematic questions |
| Cost | Low | High (LLM used at index time) |
| Global questions | Poor | Excellent |
| Specific lookups | Excellent | Good |
| Index size | Small | Large |
5.6 Long-Context Strategies and Lost-in-the-Middle Problem
LLMs now support context windows of 128K-2M tokens. Should you just stuff everything in and skip retrieval?
The Lost-in-the-Middle Problem
Research (Liu et al., 2023) found that LLMs are significantly worse at using information placed in the middle of a long context versus at the beginning or end. In a 20-document context, LLMs use document 1 and document 20 well, but document 10 is often ignored.
Strategies to Combat This
Strategy 1: Retrieval before long context
Use retrieval to select the 5-10 most relevant documents, then pass only those to the LLM. You avoid stuffing irrelevant content that confuses the model.
Strategy 2: Reorder retrieved documents
Place the most relevant documents at the beginning and end of the context, not in the middle.
# After retrieval and reranking, reorder documents:
# Put most relevant at position 0 and position -1
# Put least relevant in the middle
def reorder_for_position_bias(docs: list) -> list:
result = []
for i, doc in enumerate(docs):
if i % 2 == 0:
result.append(doc) # even-indexed go to the end
else:
result.insert(0, doc) # odd-indexed go to the front
return resultStrategy 3: Long-context + RAG hybrid
Use RAG to retrieve top-20 documents, but then pass all 20 to a long-context model. You get precision from retrieval but recover from any misses with the long context.
Part 4: Vector Databases and Storage Architecture
A vector database is purpose-built for storing embeddings and performing similarity search efficiently at scale.
Core Concepts
Approximate Nearest Neighbour (ANN) Search
Searching all N vectors for the closest one (brute force) is O(N) -- too slow for millions of vectors. Vector databases use Approximate Nearest Neighbour (ANN) algorithms that trade a small amount of accuracy for massive speed gains.
Popular ANN algorithms:
| Algorithm | Description | Used In |
|---|---|---|
| HNSW (Hierarchical Navigable Small Worlds) | Graph-based, very fast, high accuracy | Weaviate, Qdrant, pgvector |
| IVF (Inverted File Index) | Clusters vectors, searches only nearest clusters | FAISS, Pinecone |
| LSH (Locality Sensitive Hashing) | Hash-based, very fast, lower accuracy | Older systems |
| DiskANN | Disk-based, handles billion-scale vectors | Azure AI Search |
Major Vector Database Options
| Database | Type | Best For |
|---|---|---|
| Pinecone | Managed cloud | Fast startup, production scale, no infrastructure |
| Weaviate | Open source / cloud | Rich features, hybrid search, GraphQL API |
| Qdrant | Open source / cloud | Performance, filtering, Rust-based |
| Chroma | Open source, local | Prototyping, local development |
| pgvector | PostgreSQL extension | Teams already using Postgres |
| Milvus | Open source | Large scale, enterprise on-prem |
| Redis Vector | Redis extension | Low latency, teams already using Redis |
| Azure AI Search | Managed cloud | Azure ecosystem, hybrid search |
| OpenSearch | Open source / managed | Teams using OpenSearch/Elasticsearch |
Choosing a Vector Database
- Prototyping / small scale: Chroma (local) or pgvector (if you have Postgres)
- Production, fully managed: Pinecone or Weaviate Cloud
- High performance, self-hosted: Qdrant
- Hybrid search built-in: Weaviate or Qdrant
- Azure ecosystem: Azure AI Search
Storage Architecture Patterns
Pattern 1: Separate Vector Store + Document Store
[Vector DB] [Document Store (S3, MongoDB, etc.)]
- vector - original text
- chunk_id → - full document
- metadata - metadataThe vector DB stores only vectors and chunk IDs. The full text lives separately. Cheaper, more flexible.
Pattern 2: All-in-One (Vector + Metadata + Text)
Modern vector databases like Weaviate and Qdrant store vectors, metadata, and text together. Simpler, but more expensive at scale.
Pattern 3: Layered Cache
L1: In-memory cache (Redis) -- most recent/frequent queries
L2: Vector database -- all indexed chunks
L3: Document store -- raw original documentsFrequently asked questions are answered from cache instantly. Rare questions go through full retrieval.
Part 5: End-to-End Architecture Example
Let us walk through a complete real-world example: a bank's internal AI assistant that answers questions from customer service agents.
Scenario
Documents: 5,000 internal policy PDFs, 10 years of compliance guidelines, product manuals, FAQ documents. Total: ~50 million tokens.
Users: 500 customer service agents asking questions like "What is the foreign transaction dispute deadline?" or "What are the eligibility criteria for a premium savings account?"
Ingestion Architecture
[Source Systems]
Confluence, SharePoint, S3 PDFs
|
v
[Document Loaders] (Unstructured.io for complex PDFs, custom loaders for Confluence)
|
v
[Pre-processing] (Remove headers/footers, normalise text, extract tables to JSON)
|
v
[Chunking] (Recursive splitting, 512 tokens, 50-token overlap)
|
v
[Metadata Enrichment] (LLM-generated summary, keywords, hypothetical questions per chunk)
|
v
[Embedding] (text-embedding-3-small, 1536 dimensions, batched at 500 chunks/min)
|
v
[Vector Store] (Weaviate, HNSW index, hybrid search enabled)
|
v
[Change Detection] (File hashing, daily incremental re-index pipeline via Airflow)Query Architecture
[Agent receives query from customer service agent]
|
v
[Query Classification] (Simple Q&A? Complex multi-step? Needs SQL?)
|
v
[Query Transformation] (Rewrite, decompose if complex)
|
---------
| |
v v
[Vector [BM25 <- Hybrid retrieval in parallel
Search] Search]
| |
---------
|
v
[Fusion] (Reciprocal Rank Fusion, top 20 candidates)
|
v
[Reranking] (Cohere rerank-english-v3.0, top 5)
|
v
[Contextual Compression] (Extract most relevant sentences from each chunk)
|
v
[LLM Generation] (GPT-4o or Claude, with retrieved context)
|
v
[Grounding Check] (Is the answer supported by sources? Self-RAG evaluation)
|
v
[Response + Citations] (Answer with source document links)Infrastructure Summary
| Component | Technology |
|---|---|
| Document ingestion | Unstructured.io + custom Python loaders |
| Chunking + pipeline | LlamaIndex |
| Embedding model | text-embedding-3-small (OpenAI) |
| Vector store | Weaviate (self-hosted on GCP) |
| Reranker | Cohere rerank-english-v3.0 |
| LLM | GPT-4o (Azure OpenAI) |
| Orchestration | LangGraph (agentic workflows) |
| Pipeline scheduling | Apache Airflow |
| Caching | Redis (query-level caching) |
| Observability | LangSmith (tracing every retrieval + generation step) |
Part 6: Evaluation and Quality Metrics
You cannot improve what you do not measure. Here are the key metrics for evaluating RAG and agentic retrieval systems.
Retrieval Metrics
Recall@K
"Of all the documents that should have been retrieved, how many were actually retrieved in the top K results?"
Example:
- Documents that should answer the query: [doc_A, doc_B]
- Retrieved top 5: [doc_C, doc_A, doc_D, doc_E, doc_B]
- Both doc_A and doc_B are in top 5
Recall@5 = 2/2 = 1.0 (perfect)Precision@K
"Of the K documents retrieved, how many were actually relevant?"
Retrieved top 5: [doc_C, doc_A, doc_D, doc_E, doc_B]
Relevant docs in top 5: doc_A and doc_B → 2 out of 5
Precision@5 = 2/5 = 0.4MRR (Mean Reciprocal Rank)
Measures how high the first relevant result appears. Higher is better.
First relevant result at rank 2 → MRR = 1/2 = 0.5
First relevant result at rank 1 → MRR = 1/1 = 1.0End-to-End RAG Metrics (RAGAS Framework)
RAGAS is the industry-standard framework for evaluating RAG pipelines.
| Metric | What It Measures | How |
|---|---|---|
| Faithfulness | Is the answer grounded in the retrieved context? | LLM judge |
| Answer Relevancy | Does the answer address the question? | Embedding similarity |
| Context Precision | Are retrieved contexts relevant? | LLM judge |
| Context Recall | Were all necessary contexts retrieved? | LLM + ground truth |
| Answer Correctness | Is the answer factually correct? | LLM + ground truth |
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
score = evaluate(
dataset=my_eval_dataset, # contains query, answer, contexts, ground_truth
metrics=[faithfulness, answer_relevancy, context_recall]
)
# Output:
# faithfulness: 0.92 (92% of claims supported by retrieved context)
# answer_relevancy: 0.88 (88% of answer directly addresses the query)
# context_recall: 0.85 (85% of necessary info was retrieved)Observability and Tracing
In production, trace every request through the pipeline to debug failures.
Tools:
- LangSmith: Full tracing of LangChain/LangGraph pipelines
- Arize Phoenix: Open source, model performance monitoring
- Langfuse: Open source, supports multiple frameworks
- Weights and Biases: Experiment tracking + tracing
A trace records: the original query, the transformed queries, which chunks were retrieved (with scores), the reranked order, the full LLM prompt, and the final answer. When something goes wrong, you can inspect the exact retrieval step that caused the failure.
Glossary
| Term | Definition |
|---|---|
| Agent | An AI system that can take multi-step actions using tools to complete a goal |
| ANN (Approximate Nearest Neighbour) | Algorithm for fast approximate similarity search over vectors |
| BM25 | Classic keyword-based retrieval algorithm used in search engines |
| Chunk | A smaller piece of a document, created by splitting during ingestion |
| Chunking | The process of splitting documents into smaller, searchable pieces |
| Context window | The maximum amount of text an LLM can read at once |
| Cosine similarity | A measure of how similar two vectors are (1 = identical, 0 = unrelated) |
| Cross-encoder | A model that scores relevance by reading query and document together |
| Dense retrieval | Semantic search using vector embeddings |
| Embedding | A list of numbers that represents the meaning of text in vector space |
| GraphRAG | A RAG technique that uses knowledge graphs and community summaries |
| Hallucination | When an LLM generates confident but incorrect information |
| HNSW | An efficient graph-based algorithm for approximate nearest neighbour search |
| Hybrid retrieval | Combining dense (vector) and sparse (keyword) retrieval |
| HyDE | Technique that embeds a hypothetical answer to improve retrieval |
| Ingestion | The process of loading, processing, and indexing documents before retrieval |
| Knowledge graph | A graph structure storing entities and their relationships |
| LLM | Large Language Model (e.g. GPT-4, Claude, Gemini) |
| Metadata | Structured information about a chunk (source, date, author, section) |
| Multi-hop query | A query requiring traversal of multiple steps or documents to answer |
| Naive RAG | Basic retrieve-then-generate pipeline without optimisations |
| OCR | Optical Character Recognition -- converting images of text to machine-readable text |
| RAPTOR | Hierarchical indexing technique using recursive clustering and summarisation |
| RAG | Retrieval-Augmented Generation -- grounding LLM responses in retrieved documents |
| RAGAS | A framework for evaluating RAG pipeline quality |
| Reranking | A second-pass scoring step to improve the ordering of retrieved results |
| RRF (Reciprocal Rank Fusion) | Algorithm for combining ranked lists from multiple retrieval methods |
| Self-RAG | A technique where the LLM evaluates its own retrieval and generation quality |
| Sparse retrieval | Keyword-based retrieval (BM25) using inverted indices |
| Vector | A list of numbers representing a point in high-dimensional space |
| Vector database | A database purpose-built for storing and searching embedding vectors |
This document covers the state of the art as of mid-2025. The field moves quickly. Core concepts like chunking, embeddings, hybrid retrieval, and reranking are stable. Specific models and tools evolve rapidly.