Parallelisation in Agentic AI — Ingestion & Retrieval

Level 1 — document-level parallelism (coarsest grain)

Kafka topic: ingest-jobs — 10,000 documents as independent messages

partitions: 0 · 1 · 2 · 3 — each partition consumed by a separate worker group

Transport

↓ fan out

Worker pool — N workers process N documents simultaneously

Worker 1

doc A: load → chunk

Celery / BullMQ

Worker 2

doc B: load → chunk

Celery / BullMQ

Worker 3

doc C: load → chunk

Celery / BullMQ

Worker N

doc N: load → chunk

Celery / BullMQ

↓

Level 2 — pipeline parallelism (different docs at different stages)

Doc A: embedding

stage 3 of pipeline

Ray worker

Doc B: chunking

stage 2 of pipeline

Ray worker

Doc C: loading

stage 1 of pipeline

Ray worker

Doc D: writing

stage 4 of pipeline

Ray worker

↓

Level 3 — intra-document page parallelism (Ray actors)

200-page PDF

single document, fan to 4 actors

Ray remote

Actor: pages 1–50

extract + chunk in parallel

Ray actor

Actor: pages 51–100

extract + chunk in parallel

Ray actor

Actor: pages 101–200

extract + chunk in parallel

Ray actor

↓ merge chunks → embed

Level 4 — batch embedding (GPU utilisation)

1,024 chunks → single GPU forward pass → 1,024 vectors out

GPU utilisation: ~5% (single chunk) vs ~98% (batch of 1,024) — same wall-clock time

GPU batch

↓

Level 5 — async I/O (non-blocking writes)

Vector DB upsert

asyncio — non-blocking

Weaviate / Qdrant

S3 raw doc write

asyncio — non-blocking

Object store

Postgres metadata

asyncio — non-blocking

asyncpg

Graph DB upsert

asyncio — non-blocking

Neo4j / Neptune

↓ all writes complete (asyncio.gather)

Sequential barriers — these cannot be parallelised

chunk → embed

must chunk before embedding; order enforced within one doc

Hard dependency

embed → write

vector must exist before DB upsert

Hard dependency

load → validate

file must be read before schema check

Hard dependency

Query arrives — transform before search

Query transformation — rewrite + decompose (sequential, ~50ms)

LLM rewrites query into 3–5 sub-questions and/or expanded phrasings before the parallel fan-out

Pre-retrieval

↓ asyncio.gather() — all fire simultaneously

Parallel fan-out — all search types run concurrently

Dense search

vector cosine similarity · HNSW index · ~20–40ms

Semantic

BM25 sparse search

keyword match · inverted index · ~5–15ms

Keyword

Sub-query 1

dense search on decomposed question

Multi-query

Sub-query 2

dense search on alternate phrasing

Multi-query

Graph traversal

entity lookup · Cypher · ~10–30ms

Knowledge graph

↓ all complete (total = slowest, not sum)

Merge — Reciprocal Rank Fusion

RRF merge — score = 1/(60 + rank) summed across all result lists

deduplicate · score · sort · top-20 candidates · total latency = max(all searches) not sum(all searches)

Fusion

↓

Shard-level parallelism inside vector DB (transparent to app)

Shard 0

local HNSW search → local top-K

DB internal

Shard 1

local HNSW search → local top-K

DB internal

Shard 2

local HNSW search → local top-K

DB internal

Coordinator

merges local top-K lists → global top-K

DB internal

↓

Reranking — batch scoring (cross-encoder)

Cross-encoder reranker: 20 candidates → single batch forward pass → top-5

Cohere rerank-english-v3.0 · bge-reranker-large · scores query + doc together in one GPU pass · ~80–150ms

Post-retrieval

↓

LLM generation — grounded in retrieved context

LLM generates answer — top-5 chunks injected into prompt context

retrieval wall-clock: ~40–80ms (parallel) vs ~200–400ms (sequential) — 40–60% latency reduction

Generation

Hard sequential dependencies — cannot be parallelised

Chunk before embed

embedding model requires complete chunk text

Sequential

Embed before write

vector must exist before DB upsert

Sequential

Transform before retrieve

rewritten queries must exist before fan-out

Sequential

—

GPU memory bottleneck

Single GPU: one forward pass at a time — all CPU parallelism queues here

fix: larger GPU (bigger VRAM = bigger batches) · multiple GPUs · hosted embedding API (provider handles GPU parallelism) · model quantisation

GPU bound

—

Vector DB write throughput ceiling

HNSW index rebuild is not infinitely parallelisable — write rate ceiling exists

fix: write buffering (accumulate vectors in Redis, flush in controlled batches) · segment merging (Qdrant / Weaviate) · async indexing (index after acknowledge)

Write bound

—

Embedding API rate limits

OpenAI / Cohere rate limits: tokens-per-minute ceiling on hosted embedding APIs

fix: exponential backoff with jitter · Redis rate-limit queue (workers pull at controlled rate) · local embedding model (no rate limit) · API tier upgrade

Rate limit

—

Reranker latency in retrieval

Cross-encoder reranker: sequential over candidates (not parallelised per candidate)

fix: batch scoring (all candidates in one GPU pass) · reduce candidate pool (retrieve fewer) · use faster reranker (MiniLM vs large cross-encoder) · cache common query reranks

Latency bound

—

Lost-in-the-middle — long context side effect

Parallel retrieval returns many chunks — LLM ignores content in the middle of a long context

fix: place most-relevant chunks at position 0 and last position · least relevant go in the middle · rerank before placement · use contextual compression to shrink chunk size

LLM behaviour

Python — intra-document Ray parallelism

Ray remote: fan 200 pages across N actors simultaneously

wall-clock = slowest single page, not sum of all pages · ray.get() blocks until all complete

Ray / Python

—

Python — async I/O parallel writes

asyncio.gather: vector DB + S3 + Postgres + graph all write simultaneously

100s of writes in flight at once · no blocking · total time = slowest single write, not sum

asyncio / Python

—

Python — parallel hybrid retrieval

asyncio.gather: dense + sparse + sub-queries + graph all fire concurrently

40–80ms total vs 200–400ms sequential · RRF merge after all complete

asyncio / Python

—

Node.js — BullMQ parallel worker pool

BullMQ concurrency setting: N workers process N jobs from the queue simultaneously

priority queue · retry with backoff · job deduplication via Redis · dead letter queue for failures

BullMQ / Node.js

—

Python — Celery canvas chain (pipeline parallelism)

Celery group + chord: fan out tasks across workers, callback fires when all complete

group = run N tasks in parallel · chord = group + callback · chain = sequential steps within one doc

Celery / Python

—

Redis — rate-limit queue (API ceiling control)

Redis token bucket: workers consume tokens, refill at API rate limit — prevents 429 errors

exponential backoff with jitter on failure · Redis INCR + EXPIRE for sliding window · keeps worker pool fully busy up to (not over) the ceiling

Redis / Python or Node.js

Component detail

explanation, examples & design rationale

◈

Tap any component in the diagram to see a detailed breakdown and design rationale.

Fan-out at independence points

Parallelisation is only possible where there is no dependency between tasks. The core skill is identifying which tasks have no edges between them in the work graph — those are the fan-out points. Everything else is either a sequential step or a merge.

The two latency equations

Sequential: total = sum of all steps. Parallel: total = slowest single step. If dense search takes 40ms and BM25 takes 10ms and you run them sequentially, you wait 50ms. In parallel you wait 40ms. At scale this difference compounds enormously.

Ingestion vs retrieval grain

Ingestion parallelism is about throughput — processing thousands of documents as fast as possible, with hours or days available. Retrieval parallelism is about latency — shaving milliseconds from a user-facing query that must complete in under 200ms. The techniques overlap but the pressure is opposite.

The GPU is always the bottleneck

No amount of CPU-level parallelism overcomes a GPU bottleneck. All workers queue at the embedding model. The fix is always one of: bigger batch size, more GPUs, or offload to a hosted API that handles GPU parallelism on the provider's side.

asyncio vs multiprocessing

asyncio is for I/O-bound work: waiting for HTTP responses, database writes, S3 uploads — anything where the process is idle waiting for network. Multiprocessing / Ray is for CPU-bound work: PDF parsing, text processing, chunking logic. Using asyncio for CPU-bound tasks does not help — the GIL still blocks.