Home Agentic AI Home

Agentic AI ingestion + retrieval — quick reference

Ingestion — loading & preparing data
L
Document loaders
Read raw files (PDF, DOCX, web, Confluence, SQL) and produce plain text + metadata. Every pipeline starts here. OCR handles scanned documents.
C
Chunking
Split documents into focused pieces. Five strategies: fixed-size (with overlap), recursive (paragraph → sentence → word), semantic (topic-change boundaries), structure-aware (headings), proposition-based (LLM rewrites to standalone facts).
O
Chunk overlap
Repeat 10–20% of the previous chunk at the start of the next. Prevents losing context when a sentence is split across a boundary.
E
Embeddings
Convert text to a fixed-length vector of numbers. Similar meanings → similar vectors. Must use the same model for both documents and queries. Common: text-embedding-3-small, voyage-3, bge-large.
M
Metadata enrichment
Attach structured fields to each chunk: source file, page, section heading, LLM-generated summary, keywords, entities, and hypothetical questions the chunk answers. Enables filtering at query time.
Q
Hypothetical questions at ingest
For each chunk, use an LLM to generate 3–5 questions it would answer. Store as metadata. At query time, match against both chunk text and pre-generated questions — boosts recall significantly.
T
Multimodal ingestion
Tables → text description or JSON. Images → vision LLM generates a caption (stored as searchable text). Audio → Whisper transcription → chunked text. All reduce to embeddable text.
G
Knowledge graph ingestion
Extract entities and relationships from text. Store as nodes + edges (not flat chunks). Enables multi-hop queries that flat retrieval cannot handle.
P
Ingestion pipeline
Load → clean → chunk → enrich → embed → store. Run incrementally using file hashing to avoid re-processing unchanged documents.
Retrieval — finding the right chunks at query time
D
Dense retrieval (vector search)
Embed the query, search for the nearest vectors using cosine similarity. Finds semantic matches even when exact words differ ("deadline" matches "time limit"). Requires vector DB.
S
Sparse retrieval (BM25)
Keyword matching that scores by term frequency + rarity. Excellent for exact terms (form names, product codes, jargon). Misses synonyms and paraphrases.
H
Hybrid retrieval
Run dense and sparse in parallel, fuse results with Reciprocal Rank Fusion (RRF). Consistently outperforms either alone. A document in both lists scores highest confidence.
R
Reranking
After retrieval, a cross-encoder model re-scores top-20 candidates by reading query + document together. Expensive but much more accurate. Pattern: retrieve 20 → rerank → keep 5.
F
RRF (Reciprocal Rank Fusion)
Fusion formula: score = 1/(k + rank) summed across lists. A result ranked #1 in both dense and sparse wins. k=60 is the standard constant. Simple but robust.
W
Query rewriting
Use an LLM to make the user's original query more explicit and keyword-rich before searching. "What happens after I send the form?" becomes a detailed question with relevant terms.
V
Multi-query expansion
LLM generates 3–5 phrasings of the same query. Retrieve for each, merge and deduplicate. Ensures wording variations do not cause misses.
X
Sub-question decomposition
Break complex queries into simpler sub-questions. Retrieve for each separately. Combine retrieved chunks. LLM synthesises one unified answer from all context.
K
Metadata filtering
Apply hard filters (division, date, status) before or during vector search. Equivalent to SQL WHERE clause on vector search. Dramatically narrows the search space.
Z
Contextual compression
After retrieval, extract only the sentences from each chunk that are relevant to the query. Strips noise, saves context window, gives the LLM cleaner input.
A
Parent-child retrieval
Small child chunks (128 tokens) for precise embedding match. When matched, return the larger parent chunk (512 tokens) to the LLM for fuller context. Best of both: precision + context.
I
Agentic / ReAct retrieval
Agent loops: Thought → Action (search) → Observe result → decide next step. Each retrieval informs the next. Handles multi-part questions requiring sequential lookups.
Q
Text-to-SQL
LLM reads the DB schema and generates SQL from natural language. Executes against live database. Risks: wrong SQL returns wrong numbers silently. Mitigate with read-only connections and few-shot examples.
Advanced techniques
Y
HyDE
Hypothetical Document Embeddings. LLM generates a fake answer to the query, embed that instead of the raw query. Fake answer uses "document vocabulary" → closer in vector space to real answers.
N
Step-back prompting
For very specific queries, first retrieve at a higher abstraction level. Specific question about one edge case → first retrieve the general policy section it belongs to.
U
RAPTOR
Recursive clustering + summarisation builds a tree from leaf chunks up to high-level summaries. Broad questions hit summary nodes. Specific questions hit leaf chunks. Solves multi-document synthesis.
J
GraphRAG
Builds a knowledge graph + community summaries at index time using an LLM. Global queries synthesise across summaries. Local queries traverse the graph. Best for corpus-wide thematic questions.
B
Self-RAG / CRAG
Self-RAG: model emits reflection tokens (is this relevant? is this grounded?). CRAG: evaluator scores retrieval quality — if low, falls back to web search. Both make RAG self-correcting.
L
Lost-in-the-middle
LLMs ignore content placed in the middle of a long context. Fix: place most relevant chunks at position 0 and the last position. Least relevant go in the middle.
Evaluation
F
Faithfulness
Are all claims in the answer supported by the retrieved context? Catches hallucination. RAGAS scores this via an LLM judge.
R
Context recall
Were all necessary chunks retrieved? Low recall = the retriever missed relevant documents. Fix by adjusting chunk size, embedding model, or adding hybrid search.
P
Context precision
Of the chunks retrieved, how many were actually relevant? Low precision = noise is being fed to the LLM. Fix with better filtering or reranking.
M
MRR (Mean Reciprocal Rank)
How high does the first correct result appear in the list? First result at rank 1 = MRR 1.0. First result at rank 4 = MRR 0.25. Higher is better.