Chunking in Agentic AI — Strategies, Tradeoffs & Enterprise Patterns

What chunking is — the foundational decision

Chunking: splitting documents into retrievable units before embedding and indexing

LLMs have a fixed context window · embedding models work best on focused, short passages · retrieval precision improves when chunks are topically coherent

Foundation

↓ five primary strategies

Strategy 1 — fixed-size chunking (baseline)

Fixed-size chunking

split every N tokens/chars regardless of content

Simple · Fast

Overlap window

repeat 10–20% of previous chunk at start of next

Context bridge

Size selection

128–256 tokens (Q&A) · 512 (general) · 1024 (summarisation)

Size guide

↓

Strategy 2 — recursive character splitting (most common default)

Recursive character splitter

tries paragraph → sentence → word boundaries in order

LangChain default

Separator hierarchy

\n\n → \n → ". " → " " — always prefers semantic breaks

Priority order

Why it wins at baseline

respects natural language structure · no LLM cost · deterministic

Rationale

↓

Strategy 3 — document structure-aware chunking

Structure-aware splitting

headings, sections, paragraphs as natural boundaries

Layout-aware

Markdown / HTML splitter

splits on # headings · preserves header hierarchy in metadata

Format-specific

Code-aware splitter

AST-based · splits on function/class boundaries · tree-sitter

Code docs

↓

Strategy 4 — semantic chunking (topic-boundary detection)

Semantic chunking

embed sentences · measure cosine similarity · cut where similarity drops

Topic-aware

Threshold tuning

percentile-based or fixed cosine drop · controls chunk count

Hyperparameter

Cost consideration

requires embedding every sentence at ingest · 10–30× more expensive

Trade-off

↓

Strategy 5 — proposition-based chunking (LLM-powered, 2024–2026)

Proposition chunking

LLM rewrites each paragraph into standalone atomic facts

SOTA quality

Before vs after

"it can be submitted by…" → "Form A can be submitted by the customer."

No ambiguity

Enterprise adoption

used for high-value doc corpora · compliance · legal · medical

Production 2025+

Parent-child (small-to-big) chunking — most widely adopted enterprise pattern 2025

Parent-child chunking: small child chunks for retrieval precision, large parent returned to LLM

child: 128 tokens for embedding match · parent: 512–1024 tokens for LLM context · indexed separately · child stores parent_id reference

Dual-index pattern

↓

RAPTOR hierarchical indexing — recursive summarisation tree

RAPTOR leaf nodes

original chunks at level 0 — precise facts

Level 0

Cluster summaries

LLM summarises each UMAP cluster → level 1 nodes

Level 1

Root summaries

recursive summarisation → corpus-level abstractions

Level 2+

Query routing

broad questions → high level · specific → leaf nodes

Dual path

↓

Late chunking — embed then chunk (2024 innovation)

Late chunking: embed the full document first, then pool token embeddings into chunk vectors

preserves full document context in every chunk embedding · resolves coreference ("it", "they", "the team") · requires long-context embedding model (jina-embeddings-v3, voyage-3) · Jina AI 2024

Context-aware embeddings

↓

Contextual retrieval — Anthropic 2024, adopted at scale 2025

Contextual retrieval: prepend LLM-generated context to each chunk before embedding

Claude reads full doc + chunk → generates 1–2 sentence context → prepended to chunk text before embedding → 49% retrieval failure reduction reported by Anthropic · BM25 + contextual = 67% reduction

Anthropic 2024

↓

Sliding window + sentence-window retrieval

Sentence-window chunking

index single sentences · retrieve surrounding window of ±3 sentences

High precision

Sliding window with stride

chunk size 512 · stride 256 · 50% overlap every chunk

Dense coverage

Retrieval vs index size

more overlap = better recall · larger index · higher storage cost

Trade-off

↓

Agentic / self-reflective chunking — frontier 2025–2026

Agentic chunking: LLM decides where to cut based on document semantics and downstream task

agent reads document → identifies entity boundaries, argument structure, claim-evidence pairs → proposes chunk boundaries → validated against embedding coherence score · used in document intelligence platforms (Reducto, LlamaParse Pro, Azure Document Intelligence 2025)

LLM-directed · 2025

Source metadata — automatically extracted at ingest

Source metadata

file name · page · type · created · last modified · author

Auto-extract

Structural metadata

section heading · parent heading · chapter · depth level

Document structure

Position metadata

chunk index · total chunks · byte offset · token start/end

Navigation

↓

LLM-generated semantic metadata — the retrieval multiplier

LLM-generated metadata: summary, keywords, entities, hypothetical questions, topic, audience

generated at ingest time per chunk · stored alongside vector · enables pre-filtering and semantic routing · expensive but dramatically improves retrieval F1 · used by Pinecone, Weaviate, LlamaIndex as standard enterprise pattern

Semantic enrichment

↓

Hypothetical questions metadata — dense retrieval booster

Hypothetical questions at ingest

LLM generates 3–5 questions the chunk would answer · stored as searchable metadata

HyDE variant

Why this works

user queries match pre-generated questions better than raw chunk text · closes vocabulary gap

Recall boost

Cost model

~$0.002 per chunk at Haiku pricing · amortised over query lifetime · cache in metadata store

Economics

↓

Metadata filtering — narrows vector search before ANN

Pre-filter by metadata

division · date range · document status · audience · classification level

WHERE clause for vectors

Performance impact

filtering 90% of corpus before ANN = 10× faster search on same hardware

Latency win

Payload indexing

Qdrant payload index · Weaviate property index · Pinecone metadata index

DB feature

↓

Document lineage — essential for compliance and audit

Chunk lineage tracking

chunk_id → parent_doc_id → source_system → ingestion_run_id · stored in Postgres

Audit trail

Chunk versioning

document updated → re-chunk → new chunk IDs · old chunks soft-deleted · version pointer

Change management

PII / sensitivity tagging

Microsoft Presidio or AWS Comprehend scans chunk text at ingest · tags stored in metadata

Data governance

Document type routing — different strategies per source type

Enterprise routing: classify document type → apply strategy → route to appropriate chunker

PDF policies → structure-aware + parent-child · contracts → proposition chunking · code repos → AST splitter · spreadsheets → row/cell chunking · transcripts → speaker-turn chunking · emails → thread chunking

Type-aware pipeline

↓

Multimodal chunking — beyond text (enterprise standard 2025)

Table chunking

Unstructured.io or Camelot · convert to markdown or JSON + text description

Structured data

Image / figure chunking

vision LLM (GPT-4o / Claude) generates caption → stored as embeddable text chunk

Vision pipeline

Audio / video chunking

Whisper → transcript → speaker-turn or time-window chunks with timestamps

Transcription

Slide deck chunking

per-slide chunks · slide title + body + speaker notes + extracted image captions

Presentation

↓

Enterprise tooling stack — what is actually deployed at scale (March 2026)

Unstructured.io

de facto enterprise standard for complex PDF / HTML extraction and layout-aware chunking

Extraction

LlamaParse Pro

cloud API · handles complex PDFs with tables, headers, multi-column layouts · LlamaIndex native

Cloud parser

Reducto

document intelligence SaaS · agentic chunking · used in fintech and legal enterprise 2025

AI-native parser

Azure Document Intelligence

prebuilt models for invoices, contracts, forms · semantic chunking integrated 2025

Azure

↓

Cloud-native chunking services — managed options per cloud

AWS Bedrock Knowledge Bases

fixed · hierarchical (parent-child) · semantic chunking · built into Bedrock ingestion pipeline

AWS managed

GCP Vertex AI Search

layout-based chunking · document AI pre-processing · grounding-native chunk metadata

GCP managed

Azure AI Search chunking

text split skill · sentence-boundary aware · integrated document cracking + OCR

Azure managed

↓

Regulated industries — compliance-specific chunking requirements

Regulated industry requirements: clause-level chunking, citation preservation, PII isolation, audit trail

financial services: section-level with FINRA/SEC citation metadata · healthcare: HIPAA PHI isolation at chunk boundary · legal: clause chunking with Bluebook citation preservation · government: classification-level metadata per chunk · all require chunk lineage stored independently of vector store

Compliance

The fundamental tension — precision vs recall vs cost

Smaller chunks = higher retrieval precision but lower context per chunk · larger chunks = more context but noisier embeddings

the chunk size is the single most impactful hyperparameter in a RAG system · wrong chunk size is the top cause of RAG failure in production · there is no universal optimal size — it depends on query type, domain, and LLM context window

Core tension

↓

Chunk size selection guide by use case

Factual Q&A

128–256 tokens · precise single-fact retrieval · medical, legal, policy

Small

General document Q&A

256–512 tokens · paragraph-level coherence · most enterprise RAG

Medium

Summarisation tasks

512–1024 tokens · richer context per chunk · report generation

Large

Code retrieval

whole function or class · AST-boundary split · never mid-function

Semantic unit

↓

Overlap strategy — preventing context loss at boundaries

Why overlap is essential

without overlap: "it" at chunk start has no referent · answer splits across boundary · retrieval misses

Problem

Overlap sizing rules

10% overlap = minimal context bridge · 20% = standard · 50% = sentence-window pattern

Sizing

Storage cost of overlap

20% overlap = 20% more chunks = 20% more vectors = 20% more storage cost

Economics

↓

Strategy selection matrix

Use fixed-size when

homogeneous docs · speed priority · prototyping · baseline to beat

Choose if

Use recursive when

mixed doc types · default production choice · unknown domain structure

Choose if

Use semantic when

topic shifts are sharp · high retrieval precision required · budget allows ingest cost

Choose if

Use proposition when

high-value corpus · compliance · answer quality is business-critical · cost justified

Choose if

↓

Evaluation — how to know your chunking is working

RAGAS evaluation

context recall · context precision · faithfulness · answer relevancy — per chunk strategy

Framework

Golden dataset

20–50 hand-labelled query → expected chunk pairs · used to A/B test strategies

Ground truth

Chunk utilisation rate

% of retrieved chunks actually used by LLM in final answer · low = noisy retrieval

Metric

Recursive character splitting — production default (Python)

LangChain RecursiveCharacterTextSplitter — most common production baseline

chunk_size · chunk_overlap · separator hierarchy · length_function for token-accurate sizing

LangChain · Python

—

Semantic chunking — topic-boundary detection (Python)

LangChain SemanticChunker — embed sentences, cut on cosine drop

breakpoint_threshold_type: percentile / standard_deviation / gradient · requires embedding model at ingest

LangChain · Python

—

Parent-child dual index — enterprise retrieval pattern (Python)

LlamaIndex ParentDocumentRetriever — small chunks for matching, large parent returned to LLM

child_splitter 128 tokens · parent_splitter 512 tokens · child stores parent_id · retrieval fetches parent on match

LlamaIndex · Python

—

Proposition chunking — LLM-powered atomic facts (Python)

Custom proposition chunker using Claude / GPT-4o — rewrites paragraphs into self-contained facts

batch paragraphs → LLM returns JSON list of propositions → each proposition embedded independently

Custom · Python

—

Contextual retrieval — Anthropic pattern (Python)

Contextual retrieval: prepend LLM context to each chunk before embedding

for each chunk: call Claude with (full_doc, chunk) → get 2-sentence context → prepend to chunk → embed the enriched chunk

Anthropic pattern · Python

—

AST code splitter — structure-aware code chunking (Python)

LangChain RecursiveCharacterTextSplitter for code — language-aware split on function/class boundaries

Language.PYTHON · Language.JS · Language.GO · uses tree-sitter under the hood · never splits mid-function

LangChain · Python

Component detail

explanation, examples & design rationale

◈

Tap any node in the diagram to see a full breakdown, examples, and enterprise context.

Chunking is not a solved problem

As of March 2026, there is no universally correct chunking strategy. The optimal approach depends on document type, query distribution, embedding model, and LLM context window. Every production system should A/B test at least two strategies against a golden dataset before committing.

The chunk size is the most important hyperparameter

Wrong chunk size is the single most common cause of RAG failure in production. Too small: chunks lose surrounding context, embeddings are noisy, and multi-sentence answers fragment across boundaries. Too large: embeddings average over too many topics, retrieval is imprecise, and the LLM receives too much irrelevant context.

Parent-child is the enterprise default in 2025–2026

The parent-child (small-to-big) pattern is the most widely deployed enterprise chunking architecture. Small child chunks (128 tokens) give precise embedding matches. Large parent chunks (512–1024 tokens) give the LLM sufficient context to reason. AWS Bedrock, Azure AI Search, and LlamaIndex all support this natively.

Metadata is as important as the chunk text

A chunk without rich metadata is half-useful. The combination of LLM-generated summaries, keywords, hypothetical questions, and structural metadata turns retrieval from approximate similarity search into targeted knowledge retrieval. Every enterprise deployment should invest in metadata enrichment at ingest time.

Evaluation closes the loop

Build a golden dataset of 20–50 query-to-expected-chunk pairs before deploying. Run RAGAS context recall and context precision metrics against it whenever you change chunking strategy, chunk size, or overlap. Without this, you are guessing.