Agentic AI Flash Cards

Ingestion — loading & preparing data

Document loaders

Read raw files (PDF, DOCX, web, Confluence, SQL) and produce plain text + metadata. Every pipeline starts here. OCR handles scanned documents.

Chunking

Split documents into focused pieces. Five strategies: fixed-size (with overlap), recursive (paragraph → sentence → word), semantic (topic-change boundaries), structure-aware (headings), proposition-based (LLM rewrites to standalone facts).

Chunk overlap

Repeat 10–20% of the previous chunk at the start of the next. Prevents losing context when a sentence is split across a boundary.

Embeddings

Convert text to a fixed-length vector of numbers. Similar meanings → similar vectors. Must use the same model for both documents and queries. Common: text-embedding-3-small, voyage-3, bge-large.

Metadata enrichment

Attach structured fields to each chunk: source file, page, section heading, LLM-generated summary, keywords, entities, and hypothetical questions the chunk answers. Enables filtering at query time.

Hypothetical questions at ingest

For each chunk, use an LLM to generate 3–5 questions it would answer. Store as metadata. At query time, match against both chunk text and pre-generated questions — boosts recall significantly.

Multimodal ingestion

Tables → text description or JSON. Images → vision LLM generates a caption (stored as searchable text). Audio → Whisper transcription → chunked text. All reduce to embeddable text.

Knowledge graph ingestion

Extract entities and relationships from text. Store as nodes + edges (not flat chunks). Enables multi-hop queries that flat retrieval cannot handle.

Ingestion pipeline

Load → clean → chunk → enrich → embed → store. Run incrementally using file hashing to avoid re-processing unchanged documents.

Retrieval — finding the right chunks at query time

Dense retrieval (vector search)

Embed the query, search for the nearest vectors using cosine similarity. Finds semantic matches even when exact words differ ("deadline" matches "time limit"). Requires vector DB.

Sparse retrieval (BM25)

Keyword matching that scores by term frequency + rarity. Excellent for exact terms (form names, product codes, jargon). Misses synonyms and paraphrases.

Hybrid retrieval

Run dense and sparse in parallel, fuse results with Reciprocal Rank Fusion (RRF). Consistently outperforms either alone. A document in both lists scores highest confidence.

Reranking

After retrieval, a cross-encoder model re-scores top-20 candidates by reading query + document together. Expensive but much more accurate. Pattern: retrieve 20 → rerank → keep 5.

RRF (Reciprocal Rank Fusion)

Fusion formula: score = 1/(k + rank) summed across lists. A result ranked #1 in both dense and sparse wins. k=60 is the standard constant. Simple but robust.

Query rewriting

Use an LLM to make the user's original query more explicit and keyword-rich before searching. "What happens after I send the form?" becomes a detailed question with relevant terms.

Multi-query expansion

LLM generates 3–5 phrasings of the same query. Retrieve for each, merge and deduplicate. Ensures wording variations do not cause misses.

Sub-question decomposition

Break complex queries into simpler sub-questions. Retrieve for each separately. Combine retrieved chunks. LLM synthesises one unified answer from all context.

Metadata filtering

Apply hard filters (division, date, status) before or during vector search. Equivalent to SQL WHERE clause on vector search. Dramatically narrows the search space.

Contextual compression

After retrieval, extract only the sentences from each chunk that are relevant to the query. Strips noise, saves context window, gives the LLM cleaner input.

Parent-child retrieval

Small child chunks (128 tokens) for precise embedding match. When matched, return the larger parent chunk (512 tokens) to the LLM for fuller context. Best of both: precision + context.

Agentic / ReAct retrieval

Agent loops: Thought → Action (search) → Observe result → decide next step. Each retrieval informs the next. Handles multi-part questions requiring sequential lookups.

Text-to-SQL

LLM reads the DB schema and generates SQL from natural language. Executes against live database. Risks: wrong SQL returns wrong numbers silently. Mitigate with read-only connections and few-shot examples.

Advanced techniques

HyDE

Hypothetical Document Embeddings. LLM generates a fake answer to the query, embed that instead of the raw query. Fake answer uses "document vocabulary" → closer in vector space to real answers.

Step-back prompting

For very specific queries, first retrieve at a higher abstraction level. Specific question about one edge case → first retrieve the general policy section it belongs to.

RAPTOR

Recursive clustering + summarisation builds a tree from leaf chunks up to high-level summaries. Broad questions hit summary nodes. Specific questions hit leaf chunks. Solves multi-document synthesis.

GraphRAG

Builds a knowledge graph + community summaries at index time using an LLM. Global queries synthesise across summaries. Local queries traverse the graph. Best for corpus-wide thematic questions.

Self-RAG / CRAG

Self-RAG: model emits reflection tokens (is this relevant? is this grounded?). CRAG: evaluator scores retrieval quality — if low, falls back to web search. Both make RAG self-correcting.

Lost-in-the-middle

LLMs ignore content placed in the middle of a long context. Fix: place most relevant chunks at position 0 and the last position. Least relevant go in the middle.

Evaluation

Faithfulness

Are all claims in the answer supported by the retrieved context? Catches hallucination. RAGAS scores this via an LLM judge.

Context recall

Were all necessary chunks retrieved? Low recall = the retriever missed relevant documents. Fix by adjusting chunk size, embedding model, or adding hybrid search.

Context precision

Of the chunks retrieved, how many were actually relevant? Low precision = noise is being fed to the LLM. Fix with better filtering or reranking.

MRR (Mean Reciprocal Rank)

How high does the first correct result appear in the list? First result at rank 1 = MRR 1.0. First result at rank 4 = MRR 0.25. Higher is better.

Agentic AI ingestion + retrieval — quick reference