Parallelisation
in Agentic
Systems
How to design, build, and deploy agent workloads that execute concurrently across distributed, containerised infrastructure — and why it fundamentally changes how you think about LLM latency.
What is parallelisation, precisely?
Parallelisation is the practice of decomposing a larger unit of work into smaller, independent sub-units that execute simultaneously rather than sequentially. The result is that total wall-clock time approaches the duration of the slowest single unit, rather than the sum of all units.
This sounds simple. The engineering consequences are profound.
Sequential execution is a tax on latency. Every step that can execute concurrently but does not is borrowed time you are paying interest on at every request.
Systems design principleIn classical computing, parallelisation operates at two levels: data parallelism (same operation, multiple data chunks) and task parallelism (different operations running concurrently). Agentic systems inherit both and add a third: model parallelism — splitting inference itself across compute units.
The latency mathematics
Consider a task with five steps, each taking 2 seconds. Sequential execution costs 10 seconds. If three of those steps are independent, parallelising them brings the critical path to 6 seconds — a 40% reduction with zero algorithmic improvement.
In agent systems, individual steps (LLM calls, RAG retrievals, API tool calls) routinely take 200ms to 3 seconds each. When an agent needs to call six tools to answer a question, the difference between sequential and parallel execution is the difference between a 12-second response and a 3-second response — with the same quality output.
SEQUENTIAL — total: 13 seconds
PARALLEL (fan-out steps 1–4, then sequential 5–6) — total: 5 seconds
Why agents demand parallelisation
Traditional software systems make deterministic, fast, local function calls. An LLM call is none of these things: it is probabilistic, slow (200ms–10s), and remote. This changes the economics of sequential execution completely.
An agentic system by definition orchestrates multiple calls: planning calls, tool execution calls, validation calls, synthesis calls. A ReAct-pattern agent making six sequential LLM calls at 1.5 seconds each produces a 9-second response. That is catastrophic for user experience and uneconomical at scale.
The three forcing functions
Latency pressure
LLM inference is fundamentally slow. The only way to achieve sub-second agentic responses is to minimise the serial chain of LLM calls on the critical path.
Throughput demand
Production agentic systems handle thousands of concurrent users. Sequential per-request processing creates a throughput ceiling that cannot be broken without parallelism.
Task decomposition
Most complex agent tasks are naturally decomposable — summarise these 50 documents, check these 10 data sources, evaluate these 5 answers. Decomposition without parallelism misses the entire point.
Model diversity
Agents route to different models (fast/cheap for classification, capable for reasoning). Running these concurrently against the same input extracts the best of each.
Speculative gains
Launching multiple possible next steps simultaneously and using the first successful result — discarding the rest — trades compute cost for latency reduction.
Tool independence
Most tool calls within an agent step are data-independent. Calling a weather API, a database, and a search index simultaneously versus sequentially is pure parallelism gain.
The bottlenecks parallelisation solves
Before designing parallelisation, you must correctly identify what is actually slow. In agent systems, the bottlenecks are predictable and measurable.
| Bottleneck | Typical cost | Parallelisable? | Strategy |
|---|---|---|---|
| LLM inference | 500ms–10s per call | YES | Fan-out independent calls; model routing |
| RAG retrieval | 100–500ms per query | YES | Parallel queries across vector stores |
| Tool API calls | 50ms–2s per call | YES | asyncio.gather / Promise.all concurrent execution |
| Document processing | Seconds–minutes per doc | YES | Map-reduce across worker pool |
| Embedding generation | 50–200ms per batch | YES | Batch API + parallel batch submission |
| Agent planning (serial) | 1–3s per plan step | LIMITED | Speculative execution; plan caching |
| Final synthesis | 1–5s | NO | Reduce critical path feeding into it |
| State persistence | 1–10ms (Redis) | PIPELINE | Write-behind, async persistence |
Fan-out / fan-in
Fan-out / fan-in is the most fundamental pattern in agentic parallelisation. A single input is distributed ("fanned out") to multiple concurrent workers, each processes independently, and results are collected ("fanned in") by an aggregator.
In agent systems, fan-out typically happens at the tool execution layer: the planner identifies N independent actions, fires all N simultaneously, and the orchestrator waits for all (or a quorum) to complete before proceeding.
Tool call fan-out
When an agent needs to call multiple tools to gather information for a single response, all independent calls are fired simultaneously. The agent awaits the last one to complete.
When to use: When tool results are independent — one result does not depend on another's output. Accounts for roughly 70% of real-world agent fan-out usage.
Aggregation: Results are merged into a unified context that the final LLM call synthesises. Result ordering is deterministic via keyed futures, not arrival order.
Failure handling: Individual tool failures should not abort the fan-out. Partial results with explicit nulls outperform a full retry in most cases.
Document fan-out
A corpus of documents is distributed across a worker pool. Each worker independently processes its assigned documents — summarising, extracting, classifying. Results are collected and reduced.
When to use: Bulk document analysis, knowledge base ingestion, audit pipelines. Achieves near-linear throughput scaling with worker count up to LLM rate limit.
Chunk size matters: Documents too large for a single context window must be chunked before fan-out. Chunk boundaries must be semantically coherent (sentence boundaries, paragraph breaks) not arbitrary byte offsets.
Rate limit awareness: Workers must share a token-bucket rate limiter. Naive parallelism exceeds the LLM provider's tokens-per-minute quota and causes cascading 429 errors.
Model fan-out (ensemble)
The same prompt is sent simultaneously to multiple LLM models. Responses are collected and a judge/aggregator selects the best or synthesises a consensus answer.
When to use: High-stakes decisions, fact-checking, content moderation, tasks where model variance matters. Especially valuable for classification tasks where a 3-model majority vote reduces error rates significantly.
Trade-off: Costs 2-3x more in tokens. Justified when accuracy improvement materially changes downstream decisions.
Judge pattern: A lightweight, fast model (Gemini Flash, Haiku) acts as judge — it does not regenerate but selects/scores among the candidate responses. Keep the judge cheap.
Multi-source search fan-out
A query is decomposed into multiple sub-queries or sent simultaneously to multiple knowledge sources (vector store, BM25 index, SQL, external API, web search). Results are ranked and merged.
When to use: RAG over heterogeneous knowledge sources. Enterprise agents that must query internal databases, document stores, and external APIs in a single retrieval step.
Result fusion: Reciprocal Rank Fusion (RRF) is the standard algorithm for merging ranked lists from multiple retrievers without access to individual relevance scores.
Timeout strategy: Each search has an independent timeout. A slow external API should not block results from the fast vector store. Return partial results on timeout.
Map-reduce for agents
Map-reduce adapts the classic big-data pattern for LLM workloads. A large task is mapped across independent workers (each processing a subset), and results are reduced into a final output. In agent systems, both the map and reduce steps involve LLM calls.
The pattern in detail
Hierarchical reduction
When the number of map outputs exceeds the reduce model's context window, use hierarchical (tree) reduction: reduce groups of N outputs, then reduce those group summaries, and so on until a single output remains. This is the LLM equivalent of a parallel prefix sum.
# Hierarchical map-reduce for large document corpora import asyncio from typing import List from anthropic import AsyncAnthropic client = AsyncAnthropic() CHUNK_SIZE = 8000 # tokens per chunk REDUCE_GROUP = 10 # max summaries per reduce call async def map_chunk(chunk: str, task: str) -> str: """Process a single chunk — the map step.""" resp = await client.messages.create( model="claude-haiku-4-5-20251001", # fast + cheap for map max_tokens=1024, messages=[{ "role": "user", "content": f"Task: {task}\n\nDocument chunk:\n{chunk}\n\nProvide a structured intermediate result." }] ) return resp.content[0].text async def reduce_batch(intermediates: List[str], task: str, final: bool = False) -> str: """Reduce a batch of intermediate results.""" model = "claude-sonnet-4-6" if final else "claude-haiku-4-5-20251001" combined = "\n\n---\n\n".join(intermediates) resp = await client.messages.create( model=model, max_tokens=2048, messages=[{ "role": "user", "content": f"Task: {task}\n\nIntermediate results to synthesise:\n{combined}" }] ) return resp.content[0].text async def hierarchical_map_reduce(documents: List[str], task: str) -> str: # Phase 1: MAP — all chunks processed in parallel semaphore = asyncio.Semaphore(20) # respect rate limits async def bounded_map(chunk): async with semaphore: return await map_chunk(chunk, task) intermediates = await asyncio.gather( *[bounded_map(doc) for doc in documents] ) # Phase 2: HIERARCHICAL REDUCE — tree reduction current_level = list(intermediates) while len(current_level) > 1: groups = [ current_level[i:i + REDUCE_GROUP] for i in range(0, len(current_level), REDUCE_GROUP) ] is_final = len(groups) == 1 current_level = await asyncio.gather( *[reduce_batch(g, task, final=is_final) for g in groups] ) return current_level[0]
Pipeline parallelism
Pipeline parallelism overlaps sequential stages. Rather than waiting for stage A to finish all its work before stage B begins, each stage starts processing output as soon as stage A produces it. This is streaming applied to agent orchestration.
The classic example in agent systems is streaming token generation into downstream processing. As the LLM streams tokens for step N, a parser begins extracting structured data. By the time the LLM finishes, the extracted data is already ready for step N+1.
import asyncio from anthropic import AsyncAnthropic client = AsyncAnthropic() async def stream_and_extract(prompt: str) -> dict: """Pipeline: stream LLM response while concurrently extracting structure.""" buffer = [] extracted = {} async with client.messages.stream( model="claude-sonnet-4-6", max_tokens=2048, messages=[{"role": "user", "content": prompt}] ) as stream: async for chunk in stream.text_stream: buffer.append(chunk) current_text = "".join(buffer) # Pipeline stage: extract JSON as it streams, don't wait for completion if "action" in current_text and not extracted.get("action"): extracted["action"] = parse_partial_json(current_text, "action") # Fire tool call immediately — don't wait for stream to end asyncio.create_task(dispatch_tool(extracted["action"])) return extracted async def pipeline_rag_agent(query: str) -> str: # Stage 1+2 OVERLAP: embed query while pre-warming retrieval connection embed_task = asyncio.create_task(embed_query(query)) warmup_task = asyncio.create_task(warmup_retriever()) embedding, _ = await asyncio.gather(embed_task, warmup_task) # Stage 2+3 OVERLAP: retrieve while assembling prompt template retrieve_task = asyncio.create_task(retrieve_contexts(embedding)) template_task = asyncio.create_task(build_prompt_template(query)) contexts, template = await asyncio.gather(retrieve_task, template_task) # Stage 3: LLM streams — extraction is pipelined mid-stream prompt = template.format(contexts=contexts) result = await stream_and_extract(prompt) return result
Speculative execution
Speculative execution launches work before knowing if it will be needed, discards unused results, and uses the winner of whichever branch resolves first. This trades compute cost for latency — justified when the value of a fast response exceeds the marginal cost of wasted computation.
Speculative decoding at the model level
At the LLM inference level, a small draft model (e.g. Haiku) generates candidate tokens speculatively. The large target model (e.g. Sonnet) verifies multiple tokens in parallel per forward pass. When the draft matches, the large model effectively processes multiple tokens per step. This is how providers like Anthropic and Google achieve significantly higher throughput without changing the target model.
Speculative branching at the orchestration level
At the agent orchestration level, the planner launches multiple next-step branches simultaneously before it knows which one is correct. The first branch to return a valid result is used; others are cancelled.
import asyncio from dataclasses import dataclass async def speculative_route(query: str) -> str: """Launch fast and slow paths simultaneously; return first valid result.""" async def fast_path() -> str: # Try cache first — returns instantly or raises CacheMiss result = await semantic_cache_lookup(query) if not result: raise ValueError("cache miss") return result async def medium_path() -> str: # Try RAG + small model — ~800ms context = await retrieve_and_summarise(query) return await haiku_generate(query, context) async def full_path() -> str: # Full agentic pipeline with tool calls — 3-6s return await full_agent_pipeline(query) # Stagger starts: fast fires immediately, medium after 50ms, full after 200ms # This avoids burning tokens on slow path if fast path will win tasks = { asyncio.create_task(fast_path()): "fast", asyncio.create_task(_delayed(medium_path, 0.05)): "medium", asyncio.create_task(_delayed(full_path, 0.20)): "full", } pending = set(tasks.keys()) while pending: done, pending = await asyncio.wait( pending, return_when=asyncio.FIRST_COMPLETED ) for task in done: if not task.exception(): # Cancel all remaining tasks — we have our answer for p in pending: p.cancel() return task.result() raise RuntimeError("All speculative paths failed") async def _delayed(coro_fn, delay: float): await asyncio.sleep(delay) return await coro_fn()
The actor model
The actor model structures agents as independent units ("actors") that communicate exclusively via message passing, hold private internal state, and can create child actors. No shared memory. No direct method calls. This maps elegantly to agentic systems where agents must be independently scalable, failure-isolated, and geographically distributed.
Frameworks like Microsoft AutoGen and agent implementations on Ray implement actor semantics for LLM-based agents. Each agent actor has a mailbox (message queue), a behaviour function (the LLM call + tool execution logic), and an address (durable identity in the distributed system).
| Actor property | Agent mapping | Cloud primitive |
|---|---|---|
| Mailbox | Task queue for incoming agent instructions | SQS FIFO / Service Bus / Pub/Sub |
| Private state | Agent memory (context, tool history) | Redis key per agent ID / Cosmos item |
| Address | Durable agent session ID | UUID + registration in Consul / etcd |
| Child actors | Sub-agent spawning | New task/pod creation + queue registration |
| Supervision | Orchestrator restarts failed agents | Kubernetes restartPolicy / Step Functions retry |
Async-first design
The single most impactful decision in agentic system design is committing to async-first architecture. Every LLM call, tool invocation, database read, and network request should be non-blocking by default. Synchronous calls inside an async runtime are a hidden bottleneck that prevents all parallelism patterns from working correctly.
The event loop model
Python's asyncio, Node.js's event loop, and Go's goroutines all provide cooperative multi-tasking: while one coroutine awaits a network response, the event loop runs other ready coroutines. This gives you thousands of concurrent "in-flight" operations on a single thread with no threading overhead.
# THE FOUR CORE ASYNC PARALLELISATION PRIMITIVES # 1. asyncio.gather — N independent coroutines, all results needed results = await asyncio.gather( llm_call(prompt_a), retrieve_context(query), call_tool("database", params), return_exceptions=True # don't abort on single failure ) # 2. asyncio.wait FIRST_COMPLETED — speculative, race to first valid result done, pending = await asyncio.wait( [task_a, task_b, task_c], return_when=asyncio.FIRST_COMPLETED ) winner = done.pop().result() for p in pending: p.cancel() # 3. asyncio.Semaphore — bounded concurrency (respect rate limits!) sem = asyncio.Semaphore(20) # max 20 concurrent LLM calls async def bounded_call(prompt): async with sem: return await llm_call(prompt) results = await asyncio.gather(*[bounded_call(p) for p in prompts]) # 4. asyncio.Queue — producer/consumer pipeline between stages queue: asyncio.Queue = asyncio.Queue(maxsize=100) async def producer(): async for item in document_stream(): await queue.put(item) # blocks if queue is full (backpressure) await queue.put(None) # sentinel async def consumer(): while True: item = await queue.get() if item is None: break await process(item) # Run producer and N consumers concurrently await asyncio.gather( producer(), *[consumer() for _ in range(10)] )
State management under parallelism
Parallelism introduces the hardest class of bugs in software engineering: race conditions, lost updates, and inconsistent reads. In agent systems, these manifest as agents reading stale context, tool results overwriting each other, or duplicate task execution.
The golden rule
Agent state must be partitioned by agent ID and never shared across concurrent agents without explicit synchronisation. The simplest pattern: each agent owns its own Redis key namespace (agent:{session_id}:*). No agent reads or writes another agent's namespace without explicit coordination via a message.
| State type | Storage | Access pattern | Concurrency control |
|---|---|---|---|
| Session context | Redis (TTL 30min) | Read-heavy, write on turn | Optimistic locking (WATCH/MULTI) |
| Tool results | Redis / object store | Write once, read many | Immutable keys (write once) |
| Task status | Redis / Postgres | Single writer per task | CAS (compare-and-swap) |
| Agent memory | Vector DB + Redis | Read-heavy retrieval | Eventual consistency acceptable |
| Audit log | Append-only Postgres / S3 | Write-only from agents | Append-only (no updates) |
| Global config | Redis / etcd | Read-heavy, rare writes | Read replicas + cache |
Idempotency is not optional
In distributed parallel systems, messages are delivered at least once. Tool calls will be retried. Agent steps will be re-executed on failure. Every agent action must be designed as an idempotent operation: executing it twice must produce the same result as executing it once. Use idempotency keys on all write operations.
LangGraph parallel node execution
LangGraph is the dominant orchestration framework for production agentic systems. It models agent workflows as typed state graphs where nodes are async functions and edges define execution order. Parallel branches are expressed natively by the graph topology.
from langgraph.graph import StateGraph, END from typing import TypedDict, Annotated, List import operator, asyncio class AgentState(TypedDict): query: str db_result: str rag_result: str api_result: str # Annotated with operator.add means results are MERGED not overwritten tool_outputs: Annotated[List[str], operator.add] final_answer: str # Define parallel nodes — each is an independent async function async def query_database(state: AgentState) -> dict: result = await db_lookup(state["query"]) return {"db_result": result, "tool_outputs": [result]} async def retrieve_context(state: AgentState) -> dict: result = await vector_search(state["query"]) return {"rag_result": result, "tool_outputs": [result]} async def call_external_api(state: AgentState) -> dict: result = await api_fetch(state["query"]) return {"api_result": result, "tool_outputs": [result]} async def synthesise(state: AgentState) -> dict: # This node only runs after all parallel branches complete (fan-in) combined = "\n".join(state["tool_outputs"]) answer = await llm_synthesise(state["query"], combined) return {"final_answer": answer} # BUILD THE GRAPH — topology encodes parallelism graph = StateGraph(AgentState) graph.add_node("db", query_database) graph.add_node("rag", retrieve_context) graph.add_node("api", call_external_api) graph.add_node("synth", synthesise) # Fan-out from START to three parallel branches graph.set_entry_point("db") graph.set_entry_point("rag") # LangGraph executes all entry points in parallel graph.set_entry_point("api") # Fan-in: all branches must complete before synth runs graph.add_edge("db", "synth") graph.add_edge("rag", "synth") graph.add_edge("api", "synth") graph.add_edge("synth", END) app = graph.compile() result = await app.ainvoke({"query": "What is the customer's current balance?"})
Containerised architecture for parallel agents
Containers are the natural unit of deployment for parallel agent workloads. Each agent type runs in its own container image: identical runtime environment, predictable resource envelope, independently scalable, independently deployable. Kubernetes orchestrates them at scale.
Container topology for a parallel agent system
Key container design decisions
Distributed systems design
A parallel agent system at production scale is a distributed system. All of the classical distributed systems problems apply: network partitions, message ordering, consistency vs availability trade-offs, and clock skew. These are not theoretical concerns — they manifest in production agent bugs.
Exactly-once semantics for agent tasks
Agent tool calls must not execute twice. Sending an email, writing a database record, or charging a payment must be idempotent with exactly-once guarantees. The standard pattern: assign a unique idempotency key to every task before it enters the queue. Workers check the key against a deduplication store (Redis SET or database unique constraint) before executing.
Distributed rate limiting
Forty parallel agents each making LLM calls simultaneously will exceed your provider's tokens-per-minute quota within seconds. A centralised rate limiter is mandatory. The token bucket algorithm implemented in Redis using the INCR/EXPIRE or Redis Cell module is the standard approach.
import redis.asyncio as redis import time, asyncio class TokenBucketLimiter: """Distributed token bucket — shared across all agent pods via Redis.""" def __init__(self, r: redis.Redis, key: str, capacity: int, refill_rate: float): self.r = r self.key = key # e.g. "rate:claude:tokens" self.capacity = capacity # max tokens (e.g. 100_000 TPM) self.refill_rate = refill_rate # tokens per second async def acquire(self, tokens: int) -> bool: """Atomically acquire tokens. Returns True if granted, False if throttled.""" now = time.time() # Lua script runs atomically in Redis — no race conditions script = """ local key, capacity, rate, tokens, now = KEYS[1], ARGV[1], ARGV[2], ARGV[3], ARGV[4] local bucket = redis.call('HMGET', key, 'tokens', 'last_refill') local current = tonumber(bucket[1]) or capacity local last_refill = tonumber(bucket[2]) or now local elapsed = now - last_refill local refilled = math.min(capacity, current + elapsed * rate) if refilled >= tokens then redis.call('HMSET', key, 'tokens', refilled - tokens, 'last_refill', now) redis.call('EXPIRE', key, 3600) return 1 end return 0 """ result = await self.r.eval( script, 1, self.key, self.capacity, self.refill_rate, tokens, now ) return bool(result) async def acquire_or_wait(self, tokens: int, max_wait: float = 30.0): deadline = time.time() + max_wait while time.time() < deadline: if await self.acquire(tokens): return await asyncio.sleep(0.1) raise TimeoutError(f"Rate limit: could not acquire {tokens} tokens in {max_wait}s")
AWS parallelisation deployment
AWS provides the most mature serverless primitives for agentic parallelisation. The combination of Kinesis enhanced fanout, Lambda provisioned concurrency, and Step Functions Express is the industry-leading pattern for high-throughput agent pipelines.
Step Functions Map state — fan-out pattern
{
"ProcessDocuments": {
"Type": "Map",
"ItemsPath": "$.documents",
"MaxConcurrency": 40, // 40 parallel Lambda invocations
"ToleratedFailurePercentage": 10, // allow 10% failures
"Iterator": {
"StartAt": "AnalyseChunk",
"States": {
"AnalyseChunk": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-southeast-2:*:function:agent-map-worker",
"Retry": [{
"ErrorEquals": ["ThrottlingException"],
"IntervalSeconds": 2,
"MaxAttempts": 5,
"BackoffRate": 2.0,
"JitterStrategy": "FULL"
}]
}
}
},
"Next": "ReduceResults"
}
}
Azure parallelisation deployment
Azure's standout for agentic parallelisation is Durable Functions — it provides stateful fan-out/fan-in orchestration as a first-class programming model with human-in-the-loop checkpoints and event-driven scaling via KEDA.
Durable Functions fan-out/fan-in
import azure.durable_functions as df def orchestrator_function(context: df.DurableOrchestrationContext): documents = context.get_input() # Fan-out: fire all activity functions in parallel parallel_tasks = [ context.call_activity("AnalyseDocument", doc) for doc in documents ] # Fan-in: wait for ALL to complete (or use Task.any for first-wins) results = yield context.task_all(parallel_tasks) # Human-in-the-loop: pause until external approval event approval = yield context.wait_for_external_event("ApprovalReceived") if approval["approved"]: yield context.call_activity("PublishResults", results) return results main = df.Orchestrator.create(orchestrator_function)
GCP parallelisation deployment
GCP's Pub/Sub infinite horizontal scale and Vertex AI Reasoning Engine's managed agent runtime make it the strongest platform for agents that need to scale to unpredictable, very high concurrency with minimal operational overhead.
Dataflow parallel agent pipeline
import apache_beam as beam from apache_beam.options.pipeline_options import PipelineOptions class AgentAnalysisDoFn(beam.DoFn): """Parallel per-element agent execution — runs on each Dataflow worker.""" def setup(self): # Called once per worker — initialise client here, not per element from anthropic import Anthropic self.client = Anthropic() def process(self, element): response = self.client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=512, messages=[{"role": "user", "content": element["text"]}] ) yield { "id": element["id"], "analysis": response.content[0].text } options = PipelineOptions([ "--runner=DataflowRunner", "--project=my-project", "--region=australia-southeast1", "--max_num_workers=100", # up to 100 parallel workers ]) with beam.Pipeline(options=options) as p: (p | "Read" >> beam.io.ReadFromPubSub(topic="projects/*/topics/docs") | "Analyse" >> beam.ParDo(AgentAnalysisDoFn()) # massively parallel | "Write" >> beam.io.WriteToBigQuery("project.dataset.results") )
Observability for parallel agent systems
Debugging a sequential system is hard. Debugging a parallel agent system without proper observability is nearly impossible — race conditions, partial failures, and unexpected orderings are invisible without distributed tracing.
The three mandatory signals
Distributed traces
Every agent task carries a trace context (trace-id, span-id) propagated through all message headers and LLM API calls. A single user request produces one trace that spans all parallel branches. OpenTelemetry is the standard.
Fan-out metrics
Track branch count per fan-out, p50/p99/max branch duration, and fan-in wait time (how long the aggregator waits for the slowest branch). The slowest branch is always your bottleneck — identify it.
Queue depth + lag
Consumer lag on the task queue is the single most important operational metric. Lag growing means your agent workers cannot keep up. Lag flat means you are stable. Lag shrinking means you are over-provisioned.
The critical path problem
In parallel execution, the slowest branch determines total latency. Identifying which branch is the critical path requires per-branch timing in your traces. Once identified, optimise the critical path first — speeding up non-critical paths has zero impact on total latency.
Failure modes unique to parallelism
| Failure mode | Cause | Detection | Mitigation |
|---|---|---|---|
| Thundering herd | All parallel workers retry simultaneously after a transient failure, causing a traffic spike that re-triggers the failure | Correlated 429 spike across workers | Full jitter exponential backoff — never fixed-delay retry in parallel workers |
| Head-of-line blocking | One slow branch holds up the fan-in aggregator, blocking the entire pipeline despite other branches completing | Fan-in wait time p99 much higher than p50 | Timeout individual branches; aggregate with partial results |
| Cascading failures | A failed tool call causes agent to retry with different parameters, amplifying load on an already-struggling service | Error rate rising across unrelated services | Circuit breaker per tool; fail fast, don't retry immediately |
| Lost updates | Two concurrent agents write to the same state key without synchronisation, second write silently overwrites first | Missing data in agent outputs; inconsistent audit logs | Optimistic locking (version field + CAS); append-only logs |
| Partial fan-out results | Aggregator receives M of N branch results and treats them as complete, producing incorrect synthesis | Output quality degradation; missing data signals | Explicit result count assertion before reduction; mark incomplete syntheses |
| Rate limit cascade | High parallelism exhausts LLM provider quota, all workers start failing, tasks pile up in queue | Queue depth rising; 429 error rate spike | Shared token bucket limiter; provisioned throughput; model routing to spare capacity |
The cost model of parallelisation
Parallelisation reduces latency but increases cost. Every parallel branch consumes tokens, compute, and API quota. The cost-latency trade-off must be explicit and measured, not assumed.
Parallelism is a latency loan. You pay in compute cost today to avoid paying in user experience tomorrow. The interest rate is how much each parallel branch costs relative to the sequential path it replaces.
Cost-aware systems design