Chunking
in Agentic
Systems
The single most overlooked decision in a RAG pipeline — how you split documents into retrievable pieces determines whether your system finds the right information or confidently returns the wrong answer.
What is chunking, precisely?
Chunking is the process of splitting a document into smaller, independently retrievable pieces before indexing them in a vector database. It is not an optional optimisation — it is a structural requirement. LLMs have finite context windows. Embedding models produce meaningfully worse representations for very long texts. And retrieval systems must return the specific passage that answers a question, not a 200-page document.
At its simplest: you have a 40,000-word policy document. A user asks "what is the deadline to dispute a foreign transaction?" You cannot embed the entire document and compare it to that question — the signal is too diluted. You need to embed a focused 300-word passage about dispute deadlines, so that passage sits close to the question in vector space and gets retrieved.
Wrong chunk size is the most common reason a RAG system returns correct-sounding but wrong answers. The information was in the corpus. The retriever just never saw it because it was split at the wrong boundary.
Production RAG engineering observationThe output of chunking is a list of text objects, each with the original text, a vector embedding, and metadata (source file, page, section, date, and increasingly LLM-generated enrichments). Every downstream component — retrieval, reranking, generation — depends on the quality of these chunks.
What a chunk contains
A well-engineered chunk is not just a text slice. It carries:
Chunk text
The raw passage — typically 128 to 1,024 tokens. This is what gets embedded and returned to the LLM.
Vector embedding
A fixed-length float array (768–3072 dimensions) representing the chunk's semantic meaning in vector space.
Source metadata
File name, page number, section heading, document type, created/modified dates — enables hard filtering at retrieval time.
Enrichment metadata
LLM-generated summaries, keywords, hypothetical questions, detected entities — dramatically improves retrieval recall.
Why chunking dominates RAG quality
In production RAG systems, the most common failure modes trace back to chunking decisions, not to the LLM or the embedding model. The RAGAS evaluation framework consistently shows that context recall (whether the right chunks were retrieved) and context precision (whether retrieved chunks were relevant) are the primary drivers of end-to-end answer quality — and both are directly determined by how documents were chunked.
The three failure modes
The fundamental size tradeoff
Every chunking decision is a negotiation between two competing properties: retrieval precision (small chunks match queries more exactly) and answer completeness (large chunks give the LLM enough context to reason correctly). You cannot fully maximise both simultaneously with a single chunk size.
Illustrative values. Actual scores depend on document type, query style, and embedding model.
The industry consensus as of 2025 is that 256–512 tokens is the default sweet spot for general document Q&A. However, the best production systems do not use a single fixed size — they use hierarchical strategies like parent-child indexing (retrieve small, provide large) or RAPTOR (retrieve at the right level of abstraction) to resolve the tradeoff entirely.
| Chunk size | Best for | Risk | Enterprise usage |
|---|---|---|---|
| 64–128 tokens | Proposition-level facts, FAQ retrieval | Context loss, ambiguous pronouns | Niche — proposition-based RAG |
| 256–512 tokens | General policy docs, knowledge bases | May split mid-argument | Default — most enterprise RAG deployments |
| 512–1024 tokens | Legal, technical, narrative docs | Precision dilution | Common — legal/compliance RAG |
| Full section | Long-form synthesis tasks | Poor retrieval precision | Rare — parent tier only |
Fixed-size chunking with overlap
The simplest strategy: split every document at a fixed character or token count, regardless of content. Because hard splits at arbitrary positions destroy sentence and paragraph continuity, you add overlap — repeating the last N tokens of a chunk at the start of the next.
Overlap prevents the most catastrophic split failures — a sentence about Form A no longer sits isolated in chunk 1, because the beginning of chunk 2 repeats it. This continuity ensures retrieval always finds chunks with surrounding context. The cost is storage: 20% overlap increases total chunk count (and vector DB storage) by 20%.
Recommended parameters
Typical enterprise configuration: chunk_size=512 tokens, chunk_overlap=50–100 tokens (approximately 10–20%). Larger overlap (up to 25%) is appropriate for dense technical documents where individual sentences are tightly interdependent. LangChain's RecursiveCharacterTextSplitter is the standard implementation.
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_community.document_loaders import PyMuPDFLoader loader = PyMuPDFLoader("policy.pdf") docs = loader.load() # returns list[Document] with page metadata splitter = RecursiveCharacterTextSplitter( chunk_size=512, # tokens (tiktoken encoding) chunk_overlap=64, # ~12.5% overlap length_function=len, # swap for tiktoken len for token-based separators=["\n\n", "\n", ". ", " ", ""], # priority order add_start_index=True, # adds char offset to metadata ) chunks = splitter.split_documents(docs) # Each chunk: Document(page_content="...", metadata={"source":..., "page":..., "start_index":...})
Recursive character splitting
The most widely used strategy in enterprise RAG as of 2025. Rather than splitting blindly at a character count, the recursive splitter tries a hierarchy of natural separators in priority order: double newline (paragraph break), single newline, sentence end, word boundary. It only falls through to the next separator if a chunk would still exceed the size limit after applying the current one.
The effect is that the splitter follows the document's own structure as closely as possible. A 400-token paragraph stays together. A 600-token paragraph is split at its nearest sentence boundary, not at character 512. The result is far more coherent chunks than fixed-size splitting, with almost no additional complexity.
The separator hierarchy
For plain text documents: "\n\n" → "\n" → ". " → " " → "". For Markdown: headings (##, #) are added at the front. For HTML: block elements (<p>, <div>) drive the hierarchy. For code: function and class boundaries take priority.
Semantic chunking
Semantic chunking replaces fixed boundaries with topic-change detection. Rather than splitting at character counts, it splits where the content actually changes topic — detected by measuring semantic similarity between adjacent sentences.
How it works
The result is chunks that align with actual topical boundaries rather than arbitrary character counts. A section about "Form A submission requirements" becomes one chunk. A section about "resolution timelines" becomes another. The topic change, not a character count, defines the boundary.
Strengths and limitations
| Property | Semantic chunking | Recursive splitting |
|---|---|---|
| Boundary quality | Topic-aligned | Paragraph-aligned |
| Ingestion cost | High (embed every sentence) | Near-zero |
| Chunk size consistency | Variable (some very large) | Bounded by config |
| Short documents | Works, overkill | Ideal |
| Long, topic-rich documents | Excellent | Good |
| Enterprise adoption (2025) | Growing | Dominant |
from langchain_experimental.text_splitter import SemanticChunker from langchain_openai import OpenAIEmbeddings embeddings = OpenAIEmbeddings(model="text-embedding-3-small") chunker = SemanticChunker( embeddings, breakpoint_threshold_type="gradient", # or "percentile", "standard_deviation" breakpoint_threshold_amount=0.3, # tune on your corpus min_chunk_size=100, # prevent tiny slivers ) chunks = chunker.create_documents([text])
Structure-aware chunking
Structure-aware chunking uses the document's own organisational elements — headings, sections, HTML tags, Markdown hierarchy, XML elements — as chunk boundaries. The idea is that human authors already divided the document into meaningful units. The chunking strategy should respect those decisions rather than override them with character counts.
Heading-based chunking
For Markdown and HTML documents, every heading creates a potential chunk boundary. A chunk contains the heading and all content until the next heading at the same or higher level. LlamaIndex's MarkdownNodeParser and LangChain's MarkdownTextSplitter implement this natively.
A critical enhancement: store the full heading path in metadata. A chunk under "Section 4.2 — Foreign Disputes" should carry that path, not just the immediate heading. This path becomes a filter at retrieval time: "search only within Section 4 documents" is a valid metadata filter that dramatically reduces noise.
Unstructured.io for complex PDFs
Plain PDF extraction loses all structural information — headers, tables, and body text become an undifferentiated stream of characters. Unstructured.io uses machine learning to detect and classify layout elements (Title, NarrativeText, Table, ListItem, Image) before chunking. In production enterprise pipelines processing mixed-format documents, Unstructured.io is effectively the standard.
The dominant enterprise ingestion pattern for complex document corpora in 2025.
Agentic and proposition-based chunking
Introduced by Chen et al. (2023) in the Dense X Retrieval paper, proposition-based chunking uses an LLM at ingestion time to rewrite every paragraph into a set of atomic, self-contained factual statements — called propositions. Each proposition is a complete fact that can be understood without any surrounding context.
The goal of proposition-based chunking is to eliminate the ambiguity that plagues standard chunk boundaries. Every stored unit of text should be independently interpretable.
Dense X Retrieval, Chen et al. 2023The transformation in practice
Original: "It can be submitted by the customer or their representative. The form requires documentation of the charge in question, and it must be filed within 60 days."
Notice what changed: the pronoun "it" in the original becomes the explicit subject "Form A" in every proposition. "Within 60 days" gains the context "of the charge" that was implied in the original. Each proposition is now unambiguous and self-contained. An embedding of proposition 4 sits extremely close to the query "dispute form deadline" in vector space — far closer than the original ambiguous paragraph would.
Cost and enterprise adoption
Proposition chunking is expensive — every paragraph requires an LLM call. For a 10,000-document corpus, this can mean 100,000+ LLM calls at ingestion time. This cost is why it remains a niche technique: it is used in high-stakes precision environments (financial compliance, medical documentation, legal research) where retrieval accuracy justifies the expense. Most enterprise RAG systems use recursive or structural chunking for the bulk of their corpus and apply proposition-based chunking selectively to the highest-value documents.
async def extract_propositions(paragraph: str) -> list[str]: response = await client.chat.completions.create( model="gpt-4o-mini", # small model sufficient for this task messages=[{ "role": "user", "content": f"""Extract all factual propositions from the paragraph below. Each proposition must be: - A single, complete sentence - Self-contained (no pronouns that refer to prior context) - A factual statement, not a question or command Paragraph: {paragraph} Return a JSON array of proposition strings only.""" }], response_format={"type": "json_object"} ) return json.loads(response.choices[0].message.content)["propositions"]
Late chunking
Late chunking, introduced by Jina AI in 2024, inverts the conventional order of operations. In standard RAG, you chunk first, then embed each chunk independently. In late chunking, you embed the entire document first, then chunk the resulting token embeddings.
Why this matters: the context loss problem
When you embed a chunk independently, the embedding model has no context beyond the chunk text. The sentence "The deadline is 60 days" embedded in isolation has a weaker, more ambiguous representation than the same sentence embedded with surrounding context ("This section covers foreign transaction dispute deadlines. The deadline is 60 days. Exceptions apply for fraud cases."). The embedding of the isolated chunk misses the contextual signal that surrounds it in the document.
Late chunking solves this by passing the full document through the embedding model, capturing cross-chunk contextual information in the attention layers. After the full forward pass, the resulting token-level embeddings are pooled by chunk boundary. Each chunk gets an embedding that was computed with awareness of the full document context.
STANDARD ORDER
LATE CHUNKING ORDER
The limitation: the full document must fit in the embedding model's context window. Models like jina-embeddings-v3 (8,192 tokens) and voyage-3 (32,000 tokens) support longer contexts. Very long documents must still be pre-split into sections before late chunking is applied. Enterprise adoption is early — Jina AI's own benchmarks show 2–5% improvement in retrieval recall on long documents, which is meaningful at production scale.
Parent-child indexing
Parent-child indexing is the most widely adopted advanced chunking pattern in enterprise RAG as of 2025. It resolves the precision-versus-completeness tradeoff structurally, rather than trying to find a single ideal chunk size. The insight is that retrieval and generation have different requirements: retrieval wants small, precise chunks; generation wants large, contextually complete chunks.
The two-tier structure
Child chunks (retrieval tier)
Small chunks of 64–128 tokens. Embedded and stored in the vector database. Used for precise embedding matching at query time. Their small size means their vector is a sharp, focused representation of a narrow topic.
Parent chunks (generation tier)
Larger chunks of 512–1024 tokens. Stored in a document store (not necessarily a vector DB). When a child chunk is matched by retrieval, its parent is fetched and provided to the LLM. The LLM receives rich context.
The mechanics: each child chunk stores a reference to its parent's ID. At retrieval time, the system finds matching child chunks (precise, semantic match), then fetches their parent chunks (contextually complete). The LLM never sees the small child chunks — it sees the larger parents. Best of both worlds.
from llama_index.core.node_parser import HierarchicalNodeParser from llama_index.core.retrievers import AutoMergingRetriever from llama_index.core.storage.docstore import SimpleDocumentStore # Build two-tier node structure parser = HierarchicalNodeParser.from_defaults( chunk_sizes=[1024, 256, 64] # parent → child → grandchild ) nodes = parser.get_nodes_from_documents(documents) # Only leaf nodes go to vector index leaf_nodes = [n for n in nodes if n.metadata.get("is_leaf")] all_nodes = {n.node_id: n for n in nodes} # AutoMergingRetriever: fetches parent when most children retrieved base_retriever = index.as_retriever(similarity_top_k=12) retriever = AutoMergingRetriever( base_retriever, storage_context, simple_ratio_thresh=0.4 # if 40%+ of children match, return parent )
LangChain's ParentDocumentRetriever provides the same pattern with a different API. Both are widely used in enterprise deployments. The storage architecture typically combines a vector database (Weaviate, Qdrant, Pinecone) for child embeddings with a document store (Redis, MongoDB, or a simple in-memory store for smaller corpora) for parent chunks.
RAPTOR hierarchical indexing
RAPTOR (Recursive Abstractive Processing for Tree-Organised Retrieval), published by Sarthi et al. at Stanford in 2024, extends the parent-child idea into a full tree structure built by recursive clustering and LLM summarisation. It specifically addresses queries that require synthesising information across many different document sections — a task where both flat chunk retrieval and parent-child indexing struggle.
Building the RAPTOR tree
Enterprise adoption as of 2025 is growing but selective. Teams typically use RAPTOR for their highest-value, highest-query-volume document collections — a bank's full regulatory document archive, a tech company's entire product documentation corpus — while using simpler strategies for lower-value data.
Multimodal chunking
Modern enterprise documents are not plain text. A single PDF may contain narrative text, tables, charts, diagrams, and photographs. Standard text splitters discard or mangle all non-text content. Multimodal chunking handles each content type appropriately.
Tables
Tables in PDFs are notoriously destructive when extracted naively — columns merge, rows split, and the tabular structure is lost entirely. Three approaches are used in production:
Text description
An LLM (GPT-4o, Claude) reads the table structure and generates a natural language description. Embeddable as text. Loses precision for large tables with many cells.
JSON serialisation
Convert table to JSON with row/column structure. Store JSON as text chunk alongside a brief text description. Enables both semantic and exact lookup.
Structured DB storage
Extract table into a relational store. Use Text-to-SQL retrieval rather than vector retrieval for the table. Best for large, frequently queried numerical tables.
Images and figures
Images require a vision LLM to generate searchable text. At ingestion time, every image is passed to a multimodal model (GPT-4o, Claude Sonnet, Gemini Pro Vision) with a prompt like "Describe this figure in detail for retrieval purposes. Include all text, labels, trends, and key findings." The generated description is stored as a text chunk with metadata linking it to the source image.
Audio and video
Audio is transcribed with Whisper or AWS Transcribe, then treated as a long text document. The transcript is chunked by speaker turn or time window (e.g. every 3 minutes of speech). Timestamps and speaker labels are stored as metadata, enabling retrieval of "what did speaker X say about topic Y at around 00:15:00".
Code-aware chunking
Code has a fundamentally different structure than prose. Splitting a Python file at 512 characters may split a function in the middle of its docstring, or between a class definition and its methods. Code-aware chunking uses the Abstract Syntax Tree (AST) to identify semantically meaningful boundaries.
The principle
The natural unit of code is the function or method. A code-aware chunker uses tree-sitter or the built-in ast module to parse the source file and extract each function, class, or module as a separate chunk. The chunk includes the docstring, signature, and body. Related functions can be grouped by class membership.
LangChain's Language.PYTHON splitter and LlamaIndex's CodeSplitter (backed by tree-sitter) both provide this. GitHub's Copilot RAG pipeline and enterprise code search tools (Sourcegraph Cody, AWS CodeWhisperer) all use AST-aware chunking internally.
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language # Language-specific separators respect AST boundaries py_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.PYTHON, chunk_size=1500, # functions can be large — allow more tokens chunk_overlap=0, # function boundaries are hard stops — no overlap needed ) # Separators used: ["\nclass ", "\ndef ", "\n\tdef ", "\n\n", "\n", " ", ""] chunks = py_splitter.create_documents([python_source_code])
What enterprises actually use in 2025
Based on publicly documented deployments, OSS project adoption data, and engineering blog post evidence as of early 2025, the enterprise chunking landscape has converged on a tiered approach rather than any single strategy.
The teams with the best RAG performance are not using the most sophisticated chunking strategy. They are using recursive splitting with good metadata enrichment, and testing obsessively with RAGAS.
Common pattern across production RAG deploymentsRecursive character splitting (512 tokens / 64 overlap) with Unstructured.io for complex PDFs. LLM-generated metadata enrichment (summary + keywords per chunk). Parent-child indexing for high-value document collections. Weaviate or Qdrant as the vector store.
Bedrock Knowledge Bases supports fixed, semantic, and hierarchical chunking (as of re:Invent 2024). Hierarchical chunking implements parent-child indexing natively — parent chunks stored in S3, child chunks in OpenSearch Serverless. Semantic chunking uses a small model to compute sentence-level similarity. The managed service abstracts infrastructure but limits customisation.
Azure AI Search's integrated vectorisation feature (GA in 2024) handles chunking, embedding, and indexing in a single skillset pipeline. The Document Intelligence layout skill extracts structure from PDFs. The Split skill implements recursive splitting. The AzureOpenAIEmbedding skill calls text-embedding-3-large. The entire pipeline runs serverless within Azure.
Vertex AI Search (formerly Enterprise Search) provides managed RAG with configurable chunking. Vertex AI RAG Engine (released 2024) gives fine-grained control over chunking strategies, embedding models, and vector stores. Supports 1,024-token default chunks with layout-aware PDF parsing.
Benchmarks and sizing guidelines
Chunk size recommendations cannot be universal — they depend on document type, query style, and embedding model. These are the empirically validated guidelines used across multiple production deployments as of 2025.
| Document type | Recommended chunk size | Overlap | Strategy | Rationale |
|---|---|---|---|---|
| Corporate policy / procedure | 256–512 tokens | 10–15% | Recursive | Paragraph-level answers; headings preserve context |
| Legal contracts / compliance | 512–1024 tokens | 15–20% | Structural + recursive fallback | Clauses are interdependent; small chunks lose legal meaning |
| Technical documentation | 256–512 tokens | 10% | Structural (heading-based) | Section structure is meaningful; headings are queries |
| FAQ / support articles | 128–256 tokens | 0–10% | Fixed-size or proposition | Each Q&A is independent; tight precision needed |
| Research papers | 512 tokens | 15% | Section-aware + RAPTOR | Abstract/intro/conclusion need separate treatment |
| Source code | Function/class | 0 | AST-aware | Function boundaries are natural semantic units |
| Earnings calls / transcripts | 256 tokens | 20% | Speaker-turn aware | Speaker changes are semantic boundaries |
| Product catalogue | 1 product = 1 chunk | 0 | Entity-based | Product is the atomic retrieval unit |
Cloud-native chunking tools
LangChain text splitters
LangChain provides the widest variety of text splitters in any framework. RecursiveCharacterTextSplitter is the default and handles 90% of use cases. Language-specific splitters support Python, JS, TypeScript, Markdown, HTML, LaTeX, and more. SemanticChunker (experimental) provides embeddings-based boundary detection.
Best for: teams that want maximum control and composability. LangChain splitters integrate directly into LCEL chains.
LlamaIndex node parsers
LlamaIndex calls chunks "nodes" and provides parsers rather than splitters. The framework was built specifically for RAG, so its chunking primitives are more RAG-native than LangChain's. HierarchicalNodeParser natively builds parent-child trees. SentenceWindowNodeParser is a specialised small-window approach.
Best for: teams optimising specifically for RAG quality rather than general LLM pipelines. LlamaIndex provides more built-in evaluation tooling.
Unstructured.io
The enterprise standard for ingesting mixed-format document corpora. Uses ML-based layout detection to classify PDF elements before chunking. The hosted API (unstructured.io/api) and self-hosted Docker container are both used in production. The chunking_strategy="by_title" option is particularly powerful — it groups content under the same heading into a single chunk.
Best for: corpora with complex PDFs, mixed formats, scanned documents, or tables that matter.
Haystack (deepset)
Haystack 2.0 treats chunking as a pipeline component. The DocumentSplitter node is composable with other pipeline steps. Haystack has particularly strong support for hybrid retrieval pipelines that combine chunked vector search with BM25. It is the framework of choice for several European enterprise deployments.
Best for: teams building complete, production-grade pipelines where chunking is one component among many including custom preprocessing, evaluation, and deployment.
Production ingestion pipeline
A complete production chunking pipeline that handles mixed document types, runs enrichment, and writes to a vector store. This is the pattern used in enterprise deployments.
import asyncio from unstructured.partition.auto import partition from langchain.text_splitter import RecursiveCharacterTextSplitter from openai import AsyncOpenAI import weaviate client = AsyncOpenAI() vector_db = weaviate.connect_to_local() # ── Step 1: Extract with Unstructured.io ── def extract_document(path: str) -> list[str]: elements = partition(filename=path, strategy="hi_res") # group by section heading sections, current = [], [] for el in elements: if el.category == "Title" and current: sections.append("\n".join(current)); current = [] current.append(str(el)) if current: sections.append("\n".join(current)) return sections # ── Step 2: Recursive split each section ── splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=64, separators=["\n\n", "\n", ". ", " "] ) # ── Step 3: Async enrichment + embedding ── async def enrich_chunk(text: str, metadata: dict) -> dict: summary_task = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role":"user","content":f"Summarise in 1 sentence: {text}"}] ) embed_task = client.embeddings.create( input=text, model="text-embedding-3-small" ) summary_resp, embed_resp = await asyncio.gather(summary_task, embed_task) return { "text": text, "vector": embed_resp.data[0].embedding, "summary": summary_resp.choices[0].message.content, **metadata } # ── Step 4: Ingest into Weaviate ── async def ingest_file(path: str): sections = extract_document(path) for section in sections: raw_chunks = splitter.split_text(section) tasks = [ enrich_chunk(c, {"source": path, "section": section[:80]}) for c in raw_chunks ] enriched = await asyncio.gather(*tasks) # batch upsert to vector DB vector_db.batch.insert_many(enriched)
Evaluating chunk quality
You cannot judge chunk quality by looking at chunks. You judge it by measuring retrieval quality on a set of questions with known correct answers. The RAGAS framework is the standard tool. The four metrics most directly influenced by chunking are:
| Metric | What it measures | Chunking influence |
|---|---|---|
| Context recall | Were all necessary chunks retrieved? (requires ground truth) | Very high — wrong chunk size = missed retrieval |
| Context precision | Of retrieved chunks, how many were relevant? | High — overly large chunks reduce precision |
| Faithfulness | Are LLM claims grounded in retrieved context? | Moderate — complete context enables faithful answers |
| Answer relevancy | Does the answer address the question? | Moderate — downstream of retrieval quality |
from ragas import evaluate from ragas.metrics import context_recall, context_precision, faithfulness # eval_dataset: HuggingFace Dataset with columns: # question, answer, contexts (list[str]), ground_truth results = evaluate( dataset=eval_dataset, metrics=[context_recall, context_precision, faithfulness], llm=evaluation_llm, embeddings=evaluation_embeddings, ) # Interpret results: # context_recall < 0.7 → chunks too small, splitting at wrong boundary # context_precision < 0.6 → chunks too large, too much noise retrieved # faithfulness < 0.8 → LLM not grounded, may indicate context quality issue print(results)