Home Agentic AI Home

Parallelisation in Agentic AI

Ingestion & Retrieval — fan-out at independence points, merge at dependency points · tap any node to explore

Ingestion Retrieval Python / Node.js Limits Tools
Level 1 — document-level parallelism (coarsest grain)
Kafka topic: ingest-jobs — 10,000 documents as independent messages
partitions: 0 · 1 · 2 · 3 — each partition consumed by a separate worker group
Transport
↓ fan out
Worker pool — N workers process N documents simultaneously
Worker 1
doc A: load → chunk
Celery / BullMQ
Worker 2
doc B: load → chunk
Celery / BullMQ
Worker 3
doc C: load → chunk
Celery / BullMQ
Worker N
doc N: load → chunk
Celery / BullMQ
Level 2 — pipeline parallelism (different docs at different stages)
Doc A: embedding
stage 3 of pipeline
Ray worker
Doc B: chunking
stage 2 of pipeline
Ray worker
Doc C: loading
stage 1 of pipeline
Ray worker
Doc D: writing
stage 4 of pipeline
Ray worker
Level 3 — intra-document page parallelism (Ray actors)
200-page PDF
single document, fan to 4 actors
Ray remote
Actor: pages 1–50
extract + chunk in parallel
Ray actor
Actor: pages 51–100
extract + chunk in parallel
Ray actor
Actor: pages 101–200
extract + chunk in parallel
Ray actor
↓ merge chunks → embed
Level 4 — batch embedding (GPU utilisation)
1,024 chunks → single GPU forward pass → 1,024 vectors out
GPU utilisation: ~5% (single chunk) vs ~98% (batch of 1,024) — same wall-clock time
GPU batch
Level 5 — async I/O (non-blocking writes)
Vector DB upsert
asyncio — non-blocking
Weaviate / Qdrant
S3 raw doc write
asyncio — non-blocking
Object store
Postgres metadata
asyncio — non-blocking
asyncpg
Graph DB upsert
asyncio — non-blocking
Neo4j / Neptune
↓ all writes complete (asyncio.gather)
Sequential barriers — these cannot be parallelised
chunk → embed
must chunk before embedding; order enforced within one doc
Hard dependency
embed → write
vector must exist before DB upsert
Hard dependency
load → validate
file must be read before schema check
Hard dependency
Query arrives — transform before search
Query transformation — rewrite + decompose (sequential, ~50ms)
LLM rewrites query into 3–5 sub-questions and/or expanded phrasings before the parallel fan-out
Pre-retrieval
↓ asyncio.gather() — all fire simultaneously
Parallel fan-out — all search types run concurrently
Dense search
vector cosine similarity · HNSW index · ~20–40ms
Semantic
BM25 sparse search
keyword match · inverted index · ~5–15ms
Keyword
Sub-query 1
dense search on decomposed question
Multi-query
Sub-query 2
dense search on alternate phrasing
Multi-query
Graph traversal
entity lookup · Cypher · ~10–30ms
Knowledge graph
↓ all complete (total = slowest, not sum)
Merge — Reciprocal Rank Fusion
RRF merge — score = 1/(60 + rank) summed across all result lists
deduplicate · score · sort · top-20 candidates · total latency = max(all searches) not sum(all searches)
Fusion
Shard-level parallelism inside vector DB (transparent to app)
Shard 0
local HNSW search → local top-K
DB internal
Shard 1
local HNSW search → local top-K
DB internal
Shard 2
local HNSW search → local top-K
DB internal
Coordinator
merges local top-K lists → global top-K
DB internal
Reranking — batch scoring (cross-encoder)
Cross-encoder reranker: 20 candidates → single batch forward pass → top-5
Cohere rerank-english-v3.0 · bge-reranker-large · scores query + doc together in one GPU pass · ~80–150ms
Post-retrieval
LLM generation — grounded in retrieved context
LLM generates answer — top-5 chunks injected into prompt context
retrieval wall-clock: ~40–80ms (parallel) vs ~200–400ms (sequential) — 40–60% latency reduction
Generation
Hard sequential dependencies — cannot be parallelised
Chunk before embed
embedding model requires complete chunk text
Sequential
Embed before write
vector must exist before DB upsert
Sequential
Transform before retrieve
rewritten queries must exist before fan-out
Sequential
GPU memory bottleneck
Single GPU: one forward pass at a time — all CPU parallelism queues here
fix: larger GPU (bigger VRAM = bigger batches) · multiple GPUs · hosted embedding API (provider handles GPU parallelism) · model quantisation
GPU bound
Vector DB write throughput ceiling
HNSW index rebuild is not infinitely parallelisable — write rate ceiling exists
fix: write buffering (accumulate vectors in Redis, flush in controlled batches) · segment merging (Qdrant / Weaviate) · async indexing (index after acknowledge)
Write bound
Embedding API rate limits
OpenAI / Cohere rate limits: tokens-per-minute ceiling on hosted embedding APIs
fix: exponential backoff with jitter · Redis rate-limit queue (workers pull at controlled rate) · local embedding model (no rate limit) · API tier upgrade
Rate limit
Reranker latency in retrieval
Cross-encoder reranker: sequential over candidates (not parallelised per candidate)
fix: batch scoring (all candidates in one GPU pass) · reduce candidate pool (retrieve fewer) · use faster reranker (MiniLM vs large cross-encoder) · cache common query reranks
Latency bound
Lost-in-the-middle — long context side effect
Parallel retrieval returns many chunks — LLM ignores content in the middle of a long context
fix: place most-relevant chunks at position 0 and last position · least relevant go in the middle · rerank before placement · use contextual compression to shrink chunk size
LLM behaviour
Python — intra-document Ray parallelism
Ray remote: fan 200 pages across N actors simultaneously
wall-clock = slowest single page, not sum of all pages · ray.get() blocks until all complete
Ray / Python
Python — async I/O parallel writes
asyncio.gather: vector DB + S3 + Postgres + graph all write simultaneously
100s of writes in flight at once · no blocking · total time = slowest single write, not sum
asyncio / Python
Python — parallel hybrid retrieval
asyncio.gather: dense + sparse + sub-queries + graph all fire concurrently
40–80ms total vs 200–400ms sequential · RRF merge after all complete
asyncio / Python
Node.js — BullMQ parallel worker pool
BullMQ concurrency setting: N workers process N jobs from the queue simultaneously
priority queue · retry with backoff · job deduplication via Redis · dead letter queue for failures
BullMQ / Node.js
Python — Celery canvas chain (pipeline parallelism)
Celery group + chord: fan out tasks across workers, callback fires when all complete
group = run N tasks in parallel · chord = group + callback · chain = sequential steps within one doc
Celery / Python
Redis — rate-limit queue (API ceiling control)
Redis token bucket: workers consume tokens, refill at API rate limit — prevents 429 errors
exponential backoff with jitter on failure · Redis INCR + EXPIRE for sliding window · keeps worker pool fully busy up to (not over) the ceiling
Redis / Python or Node.js

Component detail

explanation, examples & design rationale

Tap any component in the diagram to see a detailed breakdown and design rationale.

Fan-out at independence points

Parallelisation is only possible where there is no dependency between tasks. The core skill is identifying which tasks have no edges between them in the work graph — those are the fan-out points. Everything else is either a sequential step or a merge.

The two latency equations

Sequential: total = sum of all steps. Parallel: total = slowest single step. If dense search takes 40ms and BM25 takes 10ms and you run them sequentially, you wait 50ms. In parallel you wait 40ms. At scale this difference compounds enormously.

Ingestion vs retrieval grain

Ingestion parallelism is about throughput — processing thousands of documents as fast as possible, with hours or days available. Retrieval parallelism is about latency — shaving milliseconds from a user-facing query that must complete in under 200ms. The techniques overlap but the pressure is opposite.

The GPU is always the bottleneck

No amount of CPU-level parallelism overcomes a GPU bottleneck. All workers queue at the embedding model. The fix is always one of: bigger batch size, more GPUs, or offload to a hosted API that handles GPU parallelism on the provider's side.

asyncio vs multiprocessing

asyncio is for I/O-bound work: waiting for HTTP responses, database writes, S3 uploads — anything where the process is idle waiting for network. Multiprocessing / Ray is for CPU-bound work: PDF parsing, text processing, chunking logic. Using asyncio for CPU-bound tasks does not help — the GIL still blocks.