Home Agentic AI Home
Agentic AI · RAG Engineering · Enterprise Architecture

Chunking
in Agentic
Systems

The single most overlooked decision in a RAG pipeline — how you split documents into retrievable pieces determines whether your system finds the right information or confidently returns the wrong answer.

7 core strategiesFixed-size through proposition-based
3 advanced techniquesRAPTOR, late chunking, parent-child
Enterprise patternsWhat teams use at scale in 2025
Depth levelSenior engineer / architect
01 — Foundations

What is chunking, precisely?

Chunking is the process of splitting a document into smaller, independently retrievable pieces before indexing them in a vector database. It is not an optional optimisation — it is a structural requirement. LLMs have finite context windows. Embedding models produce meaningfully worse representations for very long texts. And retrieval systems must return the specific passage that answers a question, not a 200-page document.

At its simplest: you have a 40,000-word policy document. A user asks "what is the deadline to dispute a foreign transaction?" You cannot embed the entire document and compare it to that question — the signal is too diluted. You need to embed a focused 300-word passage about dispute deadlines, so that passage sits close to the question in vector space and gets retrieved.

Wrong chunk size is the most common reason a RAG system returns correct-sounding but wrong answers. The information was in the corpus. The retriever just never saw it because it was split at the wrong boundary.

Production RAG engineering observation

The output of chunking is a list of text objects, each with the original text, a vector embedding, and metadata (source file, page, section, date, and increasingly LLM-generated enrichments). Every downstream component — retrieval, reranking, generation — depends on the quality of these chunks.

What a chunk contains

A well-engineered chunk is not just a text slice. It carries:

Chunk text

The raw passage — typically 128 to 1,024 tokens. This is what gets embedded and returned to the LLM.

Vector embedding

A fixed-length float array (768–3072 dimensions) representing the chunk's semantic meaning in vector space.

Source metadata

File name, page number, section heading, document type, created/modified dates — enables hard filtering at retrieval time.

Enrichment metadata

LLM-generated summaries, keywords, hypothetical questions, detected entities — dramatically improves retrieval recall.

02 — Foundations

Why chunking dominates RAG quality

In production RAG systems, the most common failure modes trace back to chunking decisions, not to the LLM or the embedding model. The RAGAS evaluation framework consistently shows that context recall (whether the right chunks were retrieved) and context precision (whether retrieved chunks were relevant) are the primary drivers of end-to-end answer quality — and both are directly determined by how documents were chunked.

The three failure modes

Split across the answer boundary The answer to the question exists in the document, but the key sentence is at the end of chunk 7 and the supporting context is at the start of chunk 8. Neither chunk alone is sufficient. Both may score low in retrieval because each is incomplete. The system returns a hallucinated answer because the real answer was never retrieved in full.
Chunks too large — precision loss A 2,000-token chunk about "dispute resolution policy" is retrieved for the query "foreign transaction dispute deadline". The chunk is relevant, but the specific deadline is one sentence in a 2,000-token sea of other policy text. The LLM receives far more noise than signal. With enough noise, it may generate an incorrect answer while appearing confident.
Chunks too small — context loss A 64-token chunk contains the sentence "The deadline is 60 days." but no context for what the deadline applies to. The chunk is retrieved, but the LLM cannot determine if this refers to disputes, complaints, refunds, or something else entirely. Small chunks sacrifice interpretability.
The evaluation trap: chunking problems are silent. Your system returns answers that sound correct. The LLM is confident. Only a ground-truth evaluation against known question-answer pairs reveals that the right chunks are not being retrieved. Always evaluate chunk quality with RAGAS or a similar framework before deploying.
03 — Foundations

The fundamental size tradeoff

Every chunking decision is a negotiation between two competing properties: retrieval precision (small chunks match queries more exactly) and answer completeness (large chunks give the LLM enough context to reason correctly). You cannot fully maximise both simultaneously with a single chunk size.

Chunk size vs retrieval quality — the core tension
64 tokens
Precision: excellent
0.91
Completeness: poor
0.22
256 tokens
Precision: good
0.78
Completeness: moderate
0.65
512 tokens
Precision: moderate
0.65
Completeness: good
0.81
1024 tokens
Precision: low
0.41
Completeness: high
0.88

Illustrative values. Actual scores depend on document type, query style, and embedding model.

The industry consensus as of 2025 is that 256–512 tokens is the default sweet spot for general document Q&A. However, the best production systems do not use a single fixed size — they use hierarchical strategies like parent-child indexing (retrieve small, provide large) or RAPTOR (retrieve at the right level of abstraction) to resolve the tradeoff entirely.

Chunk sizeBest forRiskEnterprise usage
64–128 tokensProposition-level facts, FAQ retrievalContext loss, ambiguous pronounsNiche — proposition-based RAG
256–512 tokensGeneral policy docs, knowledge basesMay split mid-argumentDefault — most enterprise RAG deployments
512–1024 tokensLegal, technical, narrative docsPrecision dilutionCommon — legal/compliance RAG
Full sectionLong-form synthesis tasksPoor retrieval precisionRare — parent tier only
04 — Strategy 1

Fixed-size chunking with overlap

The simplest strategy: split every document at a fixed character or token count, regardless of content. Because hard splits at arbitrary positions destroy sentence and paragraph continuity, you add overlap — repeating the last N tokens of a chunk at the start of the next.

Fixed-size chunking with 20% overlap visualised
The customer must submit Form A within 60 days of the transaction date. Supporting documentation, such as bank statements, must be attached to the submission. Supporting documentation, such as bank statements, must be attached to the submission. The Dispute Resolution Team will acknowledge receipt within two business days. Cases are typically resolved within 15–30 business days. Cases are typically resolved within 15–30 business days. For disputes exceeding $10,000, additional approval from the Operations Director is required before a resolution can be issued.
Chunk 1
Chunk 2
Chunk 3
Overlap region (repeated)

Overlap prevents the most catastrophic split failures — a sentence about Form A no longer sits isolated in chunk 1, because the beginning of chunk 2 repeats it. This continuity ensures retrieval always finds chunks with surrounding context. The cost is storage: 20% overlap increases total chunk count (and vector DB storage) by 20%.

Recommended parameters

Typical enterprise configuration: chunk_size=512 tokens, chunk_overlap=50–100 tokens (approximately 10–20%). Larger overlap (up to 25%) is appropriate for dense technical documents where individual sentences are tightly interdependent. LangChain's RecursiveCharacterTextSplitter is the standard implementation.

Token vs character splitting: always split on tokens, not characters, when possible. A 512-character split produces wildly inconsistent token counts (Chinese text vs English text vs code). Token-based splitting produces predictable embedding input sizes and avoids truncation errors in embedding models.
Python · LangChainFixed-size chunking — production pattern
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("policy.pdf")
docs   = loader.load()  # returns list[Document] with page metadata

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,           # tokens (tiktoken encoding)
    chunk_overlap=64,          # ~12.5% overlap
    length_function=len,        # swap for tiktoken len for token-based
    separators=["\n\n", "\n", ". ", " ", ""],  # priority order
    add_start_index=True,       # adds char offset to metadata
)

chunks = splitter.split_documents(docs)
# Each chunk: Document(page_content="...", metadata={"source":..., "page":..., "start_index":...})
05 — Strategy 2

Recursive character splitting

The most widely used strategy in enterprise RAG as of 2025. Rather than splitting blindly at a character count, the recursive splitter tries a hierarchy of natural separators in priority order: double newline (paragraph break), single newline, sentence end, word boundary. It only falls through to the next separator if a chunk would still exceed the size limit after applying the current one.

The effect is that the splitter follows the document's own structure as closely as possible. A 400-token paragraph stays together. A 600-token paragraph is split at its nearest sentence boundary, not at character 512. The result is far more coherent chunks than fixed-size splitting, with almost no additional complexity.

The separator hierarchy

For plain text documents: "\n\n" → "\n" → ". " → " " → "". For Markdown: headings (##, #) are added at the front. For HTML: block elements (<p>, <div>) drive the hierarchy. For code: function and class boundaries take priority.

Default recommendation: recursive splitting with 512 tokens / 64 overlap is the correct default for any new RAG project. Only change strategy when you have evidence that a different approach improves RAGAS scores on your specific document corpus.
06 — Strategy 3

Semantic chunking

Semantic chunking replaces fixed boundaries with topic-change detection. Rather than splitting at character counts, it splits where the content actually changes topic — detected by measuring semantic similarity between adjacent sentences.

How it works

Sentence segmentation Split the document into individual sentences using spaCy, NLTK, or a simple regex. Each sentence becomes a unit of measurement.
Embed each sentence Run every sentence through an embedding model. This is done at ingestion time, so the cost is acceptable. The result is a sequence of vectors, one per sentence.
Compute adjacent similarity For each pair of adjacent sentences, compute cosine similarity. Plot this as a similarity curve across the document. Topic continuations have high similarity (0.85+). Topic changes produce sharp drops (0.5 or below).
Split at breakpoints Apply a threshold (typically mean − standard deviation of the similarity curve). Wherever similarity drops below the threshold, insert a chunk boundary. Merge adjacent sentences until the chunk is a sensible size.

The result is chunks that align with actual topical boundaries rather than arbitrary character counts. A section about "Form A submission requirements" becomes one chunk. A section about "resolution timelines" becomes another. The topic change, not a character count, defines the boundary.

Strengths and limitations

PropertySemantic chunkingRecursive splitting
Boundary qualityTopic-alignedParagraph-aligned
Ingestion costHigh (embed every sentence)Near-zero
Chunk size consistencyVariable (some very large)Bounded by config
Short documentsWorks, overkillIdeal
Long, topic-rich documentsExcellentGood
Enterprise adoption (2025)GrowingDominant
Python · LangChainSemantic chunking
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="gradient",   # or "percentile", "standard_deviation"
    breakpoint_threshold_amount=0.3,         # tune on your corpus
    min_chunk_size=100,                      # prevent tiny slivers
)

chunks = chunker.create_documents([text])
07 — Strategy 4

Structure-aware chunking

Structure-aware chunking uses the document's own organisational elements — headings, sections, HTML tags, Markdown hierarchy, XML elements — as chunk boundaries. The idea is that human authors already divided the document into meaningful units. The chunking strategy should respect those decisions rather than override them with character counts.

Heading-based chunking

For Markdown and HTML documents, every heading creates a potential chunk boundary. A chunk contains the heading and all content until the next heading at the same or higher level. LlamaIndex's MarkdownNodeParser and LangChain's MarkdownTextSplitter implement this natively.

A critical enhancement: store the full heading path in metadata. A chunk under "Section 4.2 — Foreign Disputes" should carry that path, not just the immediate heading. This path becomes a filter at retrieval time: "search only within Section 4 documents" is a valid metadata filter that dramatically reduces noise.

Unstructured.io for complex PDFs

Plain PDF extraction loses all structural information — headers, tables, and body text become an undifferentiated stream of characters. Unstructured.io uses machine learning to detect and classify layout elements (Title, NarrativeText, Table, ListItem, Image) before chunking. In production enterprise pipelines processing mixed-format documents, Unstructured.io is effectively the standard.

Pattern Structure-aware with Unstructured.io + LangChain

The dominant enterprise ingestion pattern for complex document corpora in 2025.

Unstructured.ioLayout-aware extraction — detects titles, tables, figures, lists from PDF/DOCX/HTML
Element chunkingEach detected element type gets its own handling — tables preserved as JSON, figures captioned
Metadata injectionElement type, page number, section heading, bounding box coordinates stored per chunk
Recursive fallbackNarrative text elements that exceed size limit fall back to recursive splitting
08 — Strategy 5

Agentic and proposition-based chunking

Introduced by Chen et al. (2023) in the Dense X Retrieval paper, proposition-based chunking uses an LLM at ingestion time to rewrite every paragraph into a set of atomic, self-contained factual statements — called propositions. Each proposition is a complete fact that can be understood without any surrounding context.

The goal of proposition-based chunking is to eliminate the ambiguity that plagues standard chunk boundaries. Every stored unit of text should be independently interpretable.

Dense X Retrieval, Chen et al. 2023

The transformation in practice

Original paragraph → propositions

Original: "It can be submitted by the customer or their representative. The form requires documentation of the charge in question, and it must be filed within 60 days."

→ "Form A can be submitted by the customer."
→ "Form A can be submitted by the customer's representative."
→ "Form A requires documentation of the charge in question."
→ "Form A must be filed within 60 days of the charge."

Notice what changed: the pronoun "it" in the original becomes the explicit subject "Form A" in every proposition. "Within 60 days" gains the context "of the charge" that was implied in the original. Each proposition is now unambiguous and self-contained. An embedding of proposition 4 sits extremely close to the query "dispute form deadline" in vector space — far closer than the original ambiguous paragraph would.

Cost and enterprise adoption

Proposition chunking is expensive — every paragraph requires an LLM call. For a 10,000-document corpus, this can mean 100,000+ LLM calls at ingestion time. This cost is why it remains a niche technique: it is used in high-stakes precision environments (financial compliance, medical documentation, legal research) where retrieval accuracy justifies the expense. Most enterprise RAG systems use recursive or structural chunking for the bulk of their corpus and apply proposition-based chunking selectively to the highest-value documents.

Python · OpenAIProposition extraction at ingestion
async def extract_propositions(paragraph: str) -> list[str]:
    response = await client.chat.completions.create(
        model="gpt-4o-mini",   # small model sufficient for this task
        messages=[{
            "role": "user",
            "content": f"""Extract all factual propositions from the paragraph below.
Each proposition must be:
- A single, complete sentence
- Self-contained (no pronouns that refer to prior context)
- A factual statement, not a question or command

Paragraph:
{paragraph}

Return a JSON array of proposition strings only."""
        }],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)["propositions"]
09 — Strategy 6

Late chunking

Late chunking, introduced by Jina AI in 2024, inverts the conventional order of operations. In standard RAG, you chunk first, then embed each chunk independently. In late chunking, you embed the entire document first, then chunk the resulting token embeddings.

Why this matters: the context loss problem

When you embed a chunk independently, the embedding model has no context beyond the chunk text. The sentence "The deadline is 60 days" embedded in isolation has a weaker, more ambiguous representation than the same sentence embedded with surrounding context ("This section covers foreign transaction dispute deadlines. The deadline is 60 days. Exceptions apply for fraud cases."). The embedding of the isolated chunk misses the contextual signal that surrounds it in the document.

Late chunking solves this by passing the full document through the embedding model, capturing cross-chunk contextual information in the attention layers. After the full forward pass, the resulting token-level embeddings are pooled by chunk boundary. Each chunk gets an embedding that was computed with awareness of the full document context.

Standard chunking vs late chunking — embedding order

STANDARD ORDER

1. Chunk document → chunks[0..N]
2. Embed chunk[0] independently
3. Embed chunk[1] independently
4. Store N isolated embeddings
Each chunk unaware of neighbours

LATE CHUNKING ORDER

1. Embed full document → token embeddings
2. Apply chunk boundaries to token stream
3. Pool token embeddings per boundary
4. Store N context-aware embeddings
Each chunk aware of full document

The limitation: the full document must fit in the embedding model's context window. Models like jina-embeddings-v3 (8,192 tokens) and voyage-3 (32,000 tokens) support longer contexts. Very long documents must still be pre-split into sections before late chunking is applied. Enterprise adoption is early — Jina AI's own benchmarks show 2–5% improvement in retrieval recall on long documents, which is meaningful at production scale.

10 — Strategy 7

Parent-child indexing

Parent-child indexing is the most widely adopted advanced chunking pattern in enterprise RAG as of 2025. It resolves the precision-versus-completeness tradeoff structurally, rather than trying to find a single ideal chunk size. The insight is that retrieval and generation have different requirements: retrieval wants small, precise chunks; generation wants large, contextually complete chunks.

The two-tier structure

Child chunks (retrieval tier)

Small chunks of 64–128 tokens. Embedded and stored in the vector database. Used for precise embedding matching at query time. Their small size means their vector is a sharp, focused representation of a narrow topic.

Parent chunks (generation tier)

Larger chunks of 512–1024 tokens. Stored in a document store (not necessarily a vector DB). When a child chunk is matched by retrieval, its parent is fetched and provided to the LLM. The LLM receives rich context.

The mechanics: each child chunk stores a reference to its parent's ID. At retrieval time, the system finds matching child chunks (precise, semantic match), then fetches their parent chunks (contextually complete). The LLM never sees the small child chunks — it sees the larger parents. Best of both worlds.

Python · LlamaIndexParent-child indexing — production pattern
from llama_index.core.node_parser import HierarchicalNodeParser
from llama_index.core.retrievers import AutoMergingRetriever
from llama_index.core.storage.docstore import SimpleDocumentStore

# Build two-tier node structure
parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[1024, 256, 64]  # parent → child → grandchild
)
nodes = parser.get_nodes_from_documents(documents)

# Only leaf nodes go to vector index
leaf_nodes = [n for n in nodes if n.metadata.get("is_leaf")]
all_nodes  = {n.node_id: n for n in nodes}

# AutoMergingRetriever: fetches parent when most children retrieved
base_retriever = index.as_retriever(similarity_top_k=12)
retriever = AutoMergingRetriever(
    base_retriever,
    storage_context,
    simple_ratio_thresh=0.4  # if 40%+ of children match, return parent
)

LangChain's ParentDocumentRetriever provides the same pattern with a different API. Both are widely used in enterprise deployments. The storage architecture typically combines a vector database (Weaviate, Qdrant, Pinecone) for child embeddings with a document store (Redis, MongoDB, or a simple in-memory store for smaller corpora) for parent chunks.

11 — Advanced

RAPTOR hierarchical indexing

RAPTOR (Recursive Abstractive Processing for Tree-Organised Retrieval), published by Sarthi et al. at Stanford in 2024, extends the parent-child idea into a full tree structure built by recursive clustering and LLM summarisation. It specifically addresses queries that require synthesising information across many different document sections — a task where both flat chunk retrieval and parent-child indexing struggle.

Building the RAPTOR tree

Leaf layer — standard chunks Begin with normal document chunks (256–512 tokens). These are the leaf nodes. They are embedded and stored as usual.
Clustering — group semantically similar chunks Apply UMAP for dimensionality reduction, then Gaussian Mixture Models (GMM) for soft clustering. Each cluster groups chunks that discuss similar sub-topics. A document about banking policy might cluster into "dispute procedures", "account management", "fee schedules", and "regulatory compliance".
LLM summarisation — create parent nodes For each cluster, use an LLM to write a summary of all chunks in the cluster. This summary becomes a new "parent" node in the tree. It is embedded and stored alongside the leaf chunks.
Recursion — build higher levels Repeat the clustering and summarisation on the parent nodes. This produces grandparent nodes — summaries of summaries. Continue until a single root node (the document collection summary) is produced.
Retrieval — match at the right level At query time, retrieve from all levels simultaneously. Broad thematic questions match high-level summaries. Specific factual questions match leaf chunks. The system returns the level that best answers the query.
RAPTOR is expensive to build. A corpus of 10,000 documents requires clustering, LLM summarisation at every level, and re-embedding of all generated summaries. Expect 2–4× the ingestion cost of standard chunking. It is appropriate for large, stable corpora (legal archives, product documentation, research libraries) where the index is built once and queried many times.

Enterprise adoption as of 2025 is growing but selective. Teams typically use RAPTOR for their highest-value, highest-query-volume document collections — a bank's full regulatory document archive, a tech company's entire product documentation corpus — while using simpler strategies for lower-value data.

12 — Advanced

Multimodal chunking

Modern enterprise documents are not plain text. A single PDF may contain narrative text, tables, charts, diagrams, and photographs. Standard text splitters discard or mangle all non-text content. Multimodal chunking handles each content type appropriately.

Tables

Tables in PDFs are notoriously destructive when extracted naively — columns merge, rows split, and the tabular structure is lost entirely. Three approaches are used in production:

Text description

An LLM (GPT-4o, Claude) reads the table structure and generates a natural language description. Embeddable as text. Loses precision for large tables with many cells.

JSON serialisation

Convert table to JSON with row/column structure. Store JSON as text chunk alongside a brief text description. Enables both semantic and exact lookup.

Structured DB storage

Extract table into a relational store. Use Text-to-SQL retrieval rather than vector retrieval for the table. Best for large, frequently queried numerical tables.

Images and figures

Images require a vision LLM to generate searchable text. At ingestion time, every image is passed to a multimodal model (GPT-4o, Claude Sonnet, Gemini Pro Vision) with a prompt like "Describe this figure in detail for retrieval purposes. Include all text, labels, trends, and key findings." The generated description is stored as a text chunk with metadata linking it to the source image.

Audio and video

Audio is transcribed with Whisper or AWS Transcribe, then treated as a long text document. The transcript is chunked by speaker turn or time window (e.g. every 3 minutes of speech). Timestamps and speaker labels are stored as metadata, enabling retrieval of "what did speaker X say about topic Y at around 00:15:00".

13 — Advanced

Code-aware chunking

Code has a fundamentally different structure than prose. Splitting a Python file at 512 characters may split a function in the middle of its docstring, or between a class definition and its methods. Code-aware chunking uses the Abstract Syntax Tree (AST) to identify semantically meaningful boundaries.

The principle

The natural unit of code is the function or method. A code-aware chunker uses tree-sitter or the built-in ast module to parse the source file and extract each function, class, or module as a separate chunk. The chunk includes the docstring, signature, and body. Related functions can be grouped by class membership.

LangChain's Language.PYTHON splitter and LlamaIndex's CodeSplitter (backed by tree-sitter) both provide this. GitHub's Copilot RAG pipeline and enterprise code search tools (Sourcegraph Cody, AWS CodeWhisperer) all use AST-aware chunking internally.

Python · LangChainCode-aware chunking with tree-sitter
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

# Language-specific separators respect AST boundaries
py_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1500,   # functions can be large — allow more tokens
    chunk_overlap=0,    # function boundaries are hard stops — no overlap needed
)

# Separators used: ["\nclass ", "\ndef ", "\n\tdef ", "\n\n", "\n", " ", ""]
chunks = py_splitter.create_documents([python_source_code])
14 — Enterprise Practice 2025

What enterprises actually use in 2025

Based on publicly documented deployments, OSS project adoption data, and engineering blog post evidence as of early 2025, the enterprise chunking landscape has converged on a tiered approach rather than any single strategy.

The teams with the best RAG performance are not using the most sophisticated chunking strategy. They are using recursive splitting with good metadata enrichment, and testing obsessively with RAGAS.

Common pattern across production RAG deployments
Tier 1 Default strategy — 80% of enterprise RAG deployments

Recursive character splitting (512 tokens / 64 overlap) with Unstructured.io for complex PDFs. LLM-generated metadata enrichment (summary + keywords per chunk). Parent-child indexing for high-value document collections. Weaviate or Qdrant as the vector store.

Unstructured.ioPDF/DOCX/HTML layout extraction
RecursiveCharacterTextSplitterLangChain — primary splitter
text-embedding-3-small / voyage-3Embedding model
Weaviate / QdrantVector store with hybrid search
LLM metadata enrichmentgpt-4o-mini for summaries + keywords
ParentDocumentRetrieverLangChain parent-child pattern
AWS Amazon Bedrock Knowledge Bases — managed chunking

Bedrock Knowledge Bases supports fixed, semantic, and hierarchical chunking (as of re:Invent 2024). Hierarchical chunking implements parent-child indexing natively — parent chunks stored in S3, child chunks in OpenSearch Serverless. Semantic chunking uses a small model to compute sentence-level similarity. The managed service abstracts infrastructure but limits customisation.

Fixed chunkingDefault — configurable size + overlap
Semantic chunkingSentence-similarity boundary detection
Hierarchical chunkingNative parent-child — parent in S3, child in OpenSearch
Titan Embeddings v2Default embedding — 1024 dim
Cohere Embed 3Alternative embedding option on Bedrock
Azure Azure AI Search — integrated chunking pipeline

Azure AI Search's integrated vectorisation feature (GA in 2024) handles chunking, embedding, and indexing in a single skillset pipeline. The Document Intelligence layout skill extracts structure from PDFs. The Split skill implements recursive splitting. The AzureOpenAIEmbedding skill calls text-embedding-3-large. The entire pipeline runs serverless within Azure.

Document IntelligenceLayout extraction from PDFs/DOCX — equivalent to Unstructured.io
Text Split skillConfigurable size/overlap — recursive splitting
Azure OpenAI Embeddingstext-embedding-3-large — 3072 dim
Hybrid searchBM25 + vector (HNSW) — native to AI Search
Semantic rankerMicrosoft cross-encoder reranker built in
GCP Vertex AI Search — managed RAG with chunking

Vertex AI Search (formerly Enterprise Search) provides managed RAG with configurable chunking. Vertex AI RAG Engine (released 2024) gives fine-grained control over chunking strategies, embedding models, and vector stores. Supports 1,024-token default chunks with layout-aware PDF parsing.

Vertex AI RAG EngineFull-pipeline RAG with custom chunk size + overlap
Document AI Layout ParserStructure-aware PDF/DOCX extraction
text-embedding-005Latest Vertex embedding — 768 dim
Vector Search (ScaNN)Google's ANNS index — billion-scale
15 — Engineering

Benchmarks and sizing guidelines

Chunk size recommendations cannot be universal — they depend on document type, query style, and embedding model. These are the empirically validated guidelines used across multiple production deployments as of 2025.

Document typeRecommended chunk sizeOverlapStrategyRationale
Corporate policy / procedure256–512 tokens10–15%RecursiveParagraph-level answers; headings preserve context
Legal contracts / compliance512–1024 tokens15–20%Structural + recursive fallbackClauses are interdependent; small chunks lose legal meaning
Technical documentation256–512 tokens10%Structural (heading-based)Section structure is meaningful; headings are queries
FAQ / support articles128–256 tokens0–10%Fixed-size or propositionEach Q&A is independent; tight precision needed
Research papers512 tokens15%Section-aware + RAPTORAbstract/intro/conclusion need separate treatment
Source codeFunction/class0AST-awareFunction boundaries are natural semantic units
Earnings calls / transcripts256 tokens20%Speaker-turn awareSpeaker changes are semantic boundaries
Product catalogue1 product = 1 chunk0Entity-basedProduct is the atomic retrieval unit
The 20-question test: before committing to a chunk strategy for a new corpus, build a small evaluation set of 20 representative questions with known correct answers. Ingest 100 documents with 3 different chunk sizes. Run RAGAS. The size with the highest context recall score is your starting point. This takes 2–3 hours and saves weeks of debugging production failures.
16 — Cloud Tooling

Cloud-native chunking tools

LangChain text splitters

LangChain provides the widest variety of text splitters in any framework. RecursiveCharacterTextSplitter is the default and handles 90% of use cases. Language-specific splitters support Python, JS, TypeScript, Markdown, HTML, LaTeX, and more. SemanticChunker (experimental) provides embeddings-based boundary detection.

Best for: teams that want maximum control and composability. LangChain splitters integrate directly into LCEL chains.

RecursiveCharacterTextSplitter — default, all doc types
CharacterTextSplitter — single separator only
TokenTextSplitter — tiktoken-based, exact token counts
MarkdownTextSplitter — heading-aware
HTMLHeaderTextSplitter — tag-based structure
SemanticChunker (experimental) — embeddings-based
ParentDocumentRetriever — parent-child indexing

LlamaIndex node parsers

LlamaIndex calls chunks "nodes" and provides parsers rather than splitters. The framework was built specifically for RAG, so its chunking primitives are more RAG-native than LangChain's. HierarchicalNodeParser natively builds parent-child trees. SentenceWindowNodeParser is a specialised small-window approach.

Best for: teams optimising specifically for RAG quality rather than general LLM pipelines. LlamaIndex provides more built-in evaluation tooling.

SentenceSplitter — sentence-boundary aware
SemanticSplitterNodeParser — cosine-similarity boundaries
HierarchicalNodeParser — multi-level parent-child
SentenceWindowNodeParser — tiny chunk + window context
MarkdownNodeParser — heading-hierarchy aware
CodeSplitter (tree-sitter) — AST-aware code
JSONNodeParser — preserves JSON structure

Unstructured.io

The enterprise standard for ingesting mixed-format document corpora. Uses ML-based layout detection to classify PDF elements before chunking. The hosted API (unstructured.io/api) and self-hosted Docker container are both used in production. The chunking_strategy="by_title" option is particularly powerful — it groups content under the same heading into a single chunk.

Best for: corpora with complex PDFs, mixed formats, scanned documents, or tables that matter.

chunking_strategy="basic" — simple size-based
chunking_strategy="by_title" — group by heading
partition_pdf() — layout-aware PDF extraction
partition_docx() — Word document extraction
Table detection + HTML serialisation
OCR via Tesseract for scanned PDFs
Hosted API + self-hosted Docker

Haystack (deepset)

Haystack 2.0 treats chunking as a pipeline component. The DocumentSplitter node is composable with other pipeline steps. Haystack has particularly strong support for hybrid retrieval pipelines that combine chunked vector search with BM25. It is the framework of choice for several European enterprise deployments.

Best for: teams building complete, production-grade pipelines where chunking is one component among many including custom preprocessing, evaluation, and deployment.

DocumentSplitter — core component
split_by="word" / "sentence" / "page" / "passage"
Pipeline-native — chains with converters, writers
Weaviate / Elasticsearch / Qdrant connectors
Hybrid retrieval (BM25 + vector) built in
Haystack Eval framework for chunk quality
17 — Code in practice

Production ingestion pipeline

A complete production chunking pipeline that handles mixed document types, runs enrichment, and writes to a vector store. This is the pattern used in enterprise deployments.

Python · Full pipelineEnterprise chunking + enrichment + ingestion
import asyncio
from unstructured.partition.auto import partition
from langchain.text_splitter import RecursiveCharacterTextSplitter
from openai import AsyncOpenAI
import weaviate

client     = AsyncOpenAI()
vector_db  = weaviate.connect_to_local()

# ── Step 1: Extract with Unstructured.io ──
def extract_document(path: str) -> list[str]:
    elements = partition(filename=path, strategy="hi_res")
    # group by section heading
    sections, current = [], []
    for el in elements:
        if el.category == "Title" and current:
            sections.append("\n".join(current)); current = []
        current.append(str(el))
    if current: sections.append("\n".join(current))
    return sections

# ── Step 2: Recursive split each section ──
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512, chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " "]
)

# ── Step 3: Async enrichment + embedding ──
async def enrich_chunk(text: str, metadata: dict) -> dict:
    summary_task = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role":"user","content":f"Summarise in 1 sentence: {text}"}]
    )
    embed_task = client.embeddings.create(
        input=text, model="text-embedding-3-small"
    )
    summary_resp, embed_resp = await asyncio.gather(summary_task, embed_task)
    return {
        "text": text,
        "vector": embed_resp.data[0].embedding,
        "summary": summary_resp.choices[0].message.content,
        **metadata
    }

# ── Step 4: Ingest into Weaviate ──
async def ingest_file(path: str):
    sections = extract_document(path)
    for section in sections:
        raw_chunks = splitter.split_text(section)
        tasks = [
            enrich_chunk(c, {"source": path, "section": section[:80]})
            for c in raw_chunks
        ]
        enriched = await asyncio.gather(*tasks)
        # batch upsert to vector DB
        vector_db.batch.insert_many(enriched)
18 — Evaluation

Evaluating chunk quality

You cannot judge chunk quality by looking at chunks. You judge it by measuring retrieval quality on a set of questions with known correct answers. The RAGAS framework is the standard tool. The four metrics most directly influenced by chunking are:

MetricWhat it measuresChunking influence
Context recallWere all necessary chunks retrieved? (requires ground truth)Very high — wrong chunk size = missed retrieval
Context precisionOf retrieved chunks, how many were relevant?High — overly large chunks reduce precision
FaithfulnessAre LLM claims grounded in retrieved context?Moderate — complete context enables faithful answers
Answer relevancyDoes the answer address the question?Moderate — downstream of retrieval quality
!
The silent failure: a RAG system can have context recall of 0.4 (missing the right chunks 60% of the time) and still generate fluent, confident answers. The LLM hallucinates from partial context. The only way to detect this is ground-truth evaluation. Never ship a RAG system without a RAGAS evaluation run on a representative question set.
Python · RAGASChunk quality evaluation
from ragas import evaluate
from ragas.metrics import context_recall, context_precision, faithfulness

# eval_dataset: HuggingFace Dataset with columns:
# question, answer, contexts (list[str]), ground_truth
results = evaluate(
    dataset=eval_dataset,
    metrics=[context_recall, context_precision, faithfulness],
    llm=evaluation_llm,
    embeddings=evaluation_embeddings,
)

# Interpret results:
# context_recall < 0.7  → chunks too small, splitting at wrong boundary
# context_precision < 0.6 → chunks too large, too much noise retrieved
# faithfulness < 0.8    → LLM not grounded, may indicate context quality issue
print(results)