Home Agentic AI Home
Agentic Engineering · Systems Design · Cloud Architecture

Parallelisation
in Agentic
Systems

How to design, build, and deploy agent workloads that execute concurrently across distributed, containerised infrastructure — and why it fundamentally changes how you think about LLM latency.

8 core patternsFan-out, map-reduce, pipelines & more
3 cloud platformsAWS · Azure · GCP deployments
Production codePython, async patterns, real examples
Depth levelSenior engineer / architect
01 — Foundations

What is parallelisation, precisely?

Parallelisation is the practice of decomposing a larger unit of work into smaller, independent sub-units that execute simultaneously rather than sequentially. The result is that total wall-clock time approaches the duration of the slowest single unit, rather than the sum of all units.

This sounds simple. The engineering consequences are profound.

Sequential execution is a tax on latency. Every step that can execute concurrently but does not is borrowed time you are paying interest on at every request.

Systems design principle

In classical computing, parallelisation operates at two levels: data parallelism (same operation, multiple data chunks) and task parallelism (different operations running concurrently). Agentic systems inherit both and add a third: model parallelism — splitting inference itself across compute units.

The latency mathematics

Consider a task with five steps, each taking 2 seconds. Sequential execution costs 10 seconds. If three of those steps are independent, parallelising them brings the critical path to 6 seconds — a 40% reduction with zero algorithmic improvement.

In agent systems, individual steps (LLM calls, RAG retrievals, API tool calls) routinely take 200ms to 3 seconds each. When an agent needs to call six tools to answer a question, the difference between sequential and parallel execution is the difference between a 12-second response and a 3-second response — with the same quality output.

Sequential vs parallel execution — same 6 tool calls

SEQUENTIAL — total: 13 seconds

DB lookup
DB lookup · 2s
RAG retrieve
RAG · 3s
LLM call 1
LLM · 2s
API tool
API · 1s
LLM call 2
LLM · 3s
Format output
Format · 2s
Total: 13 seconds

PARALLEL (fan-out steps 1–4, then sequential 5–6) — total: 5 seconds

DB lookup
DB lookup · 2s
RAG retrieve
RAG · 3s (critical path)
LLM call 1
LLM · 2s
API tool
API · 1s
LLM call 2
LLM 2 · 2s
Format output
Format · 2s
Total: 5 seconds — 62% faster, identical output quality
02 — Foundations

Why agents demand parallelisation

Traditional software systems make deterministic, fast, local function calls. An LLM call is none of these things: it is probabilistic, slow (200ms–10s), and remote. This changes the economics of sequential execution completely.

An agentic system by definition orchestrates multiple calls: planning calls, tool execution calls, validation calls, synthesis calls. A ReAct-pattern agent making six sequential LLM calls at 1.5 seconds each produces a 9-second response. That is catastrophic for user experience and uneconomical at scale.

The three forcing functions

Latency pressure

LLM inference is fundamentally slow. The only way to achieve sub-second agentic responses is to minimise the serial chain of LLM calls on the critical path.

Throughput demand

Production agentic systems handle thousands of concurrent users. Sequential per-request processing creates a throughput ceiling that cannot be broken without parallelism.

Task decomposition

Most complex agent tasks are naturally decomposable — summarise these 50 documents, check these 10 data sources, evaluate these 5 answers. Decomposition without parallelism misses the entire point.

Model diversity

Agents route to different models (fast/cheap for classification, capable for reasoning). Running these concurrently against the same input extracts the best of each.

Speculative gains

Launching multiple possible next steps simultaneously and using the first successful result — discarding the rest — trades compute cost for latency reduction.

Tool independence

Most tool calls within an agent step are data-independent. Calling a weather API, a database, and a search index simultaneously versus sequentially is pure parallelism gain.

The critical insight: In agentic systems, parallelisation is not an optimisation you apply later. It must be the foundational design principle from the first line of orchestration code. Retrofitting sequential agent pipelines for parallelism is architecturally expensive and usually results in partial, leaky implementations.
03 — Foundations

The bottlenecks parallelisation solves

Before designing parallelisation, you must correctly identify what is actually slow. In agent systems, the bottlenecks are predictable and measurable.

Bottleneck Typical cost Parallelisable? Strategy
LLM inference 500ms–10s per call YES Fan-out independent calls; model routing
RAG retrieval 100–500ms per query YES Parallel queries across vector stores
Tool API calls 50ms–2s per call YES asyncio.gather / Promise.all concurrent execution
Document processing Seconds–minutes per doc YES Map-reduce across worker pool
Embedding generation 50–200ms per batch YES Batch API + parallel batch submission
Agent planning (serial) 1–3s per plan step LIMITED Speculative execution; plan caching
Final synthesis 1–5s NO Reduce critical path feeding into it
State persistence 1–10ms (Redis) PIPELINE Write-behind, async persistence
04 — Patterns

Fan-out / fan-in

Fan-out / fan-in is the most fundamental pattern in agentic parallelisation. A single input is distributed ("fanned out") to multiple concurrent workers, each processes independently, and results are collected ("fanned in") by an aggregator.

In agent systems, fan-out typically happens at the tool execution layer: the planner identifies N independent actions, fires all N simultaneously, and the orchestrator waits for all (or a quorum) to complete before proceeding.

Tool call fan-out

When an agent needs to call multiple tools to gather information for a single response, all independent calls are fired simultaneously. The agent awaits the last one to complete.

When to use: When tool results are independent — one result does not depend on another's output. Accounts for roughly 70% of real-world agent fan-out usage.

Aggregation: Results are merged into a unified context that the final LLM call synthesises. Result ordering is deterministic via keyed futures, not arrival order.

Failure handling: Individual tool failures should not abort the fan-out. Partial results with explicit nulls outperform a full retry in most cases.

Orchestrator DB lookup RAG query API call Aggregator / merge Synthesised context

Document fan-out

A corpus of documents is distributed across a worker pool. Each worker independently processes its assigned documents — summarising, extracting, classifying. Results are collected and reduced.

When to use: Bulk document analysis, knowledge base ingestion, audit pipelines. Achieves near-linear throughput scaling with worker count up to LLM rate limit.

Chunk size matters: Documents too large for a single context window must be chunked before fan-out. Chunk boundaries must be semantically coherent (sentence boundaries, paragraph breaks) not arbitrary byte offsets.

Rate limit awareness: Workers must share a token-bucket rate limiter. Naive parallelism exceeds the LLM provider's tokens-per-minute quota and causes cascading 429 errors.

50 documents · corpus Batch splitter Worker ×10 Worker ×10 Worker ×10 docs 1-10 docs 11-30 docs 31-50 Reduce → summary

Model fan-out (ensemble)

The same prompt is sent simultaneously to multiple LLM models. Responses are collected and a judge/aggregator selects the best or synthesises a consensus answer.

When to use: High-stakes decisions, fact-checking, content moderation, tasks where model variance matters. Especially valuable for classification tasks where a 3-model majority vote reduces error rates significantly.

Trade-off: Costs 2-3x more in tokens. Justified when accuracy improvement materially changes downstream decisions.

Judge pattern: A lightweight, fast model (Gemini Flash, Haiku) acts as judge — it does not regenerate but selects/scores among the candidate responses. Keep the judge cheap.

Input prompt Claude 3.5 Sonnet GPT-4o Mini Gemini Flash Judge (Haiku / Flash) Best answer selected

Multi-source search fan-out

A query is decomposed into multiple sub-queries or sent simultaneously to multiple knowledge sources (vector store, BM25 index, SQL, external API, web search). Results are ranked and merged.

When to use: RAG over heterogeneous knowledge sources. Enterprise agents that must query internal databases, document stores, and external APIs in a single retrieval step.

Result fusion: Reciprocal Rank Fusion (RRF) is the standard algorithm for merging ranked lists from multiple retrievers without access to individual relevance scores.

Timeout strategy: Each search has an independent timeout. A slow external API should not block results from the fast vector store. Return partial results on timeout.

User query Query decomposer Vector store (pgvector) BM25 index (OpenSearch) SQL DB (Postgres) Web API (external) RRF fusion + re-rank Top-K merged results
05 — Patterns

Map-reduce for agents

Map-reduce adapts the classic big-data pattern for LLM workloads. A large task is mapped across independent workers (each processing a subset), and results are reduced into a final output. In agent systems, both the map and reduce steps involve LLM calls.

The pattern in detail

Split A large input (document corpus, dataset, long context) is divided into chunks that fit within model context limits. Splitting strategy matters: semantic boundaries preserve coherence.
Map Each chunk is processed independently by a worker agent. All map workers run in parallel. Each produces a structured intermediate result (summary, extraction, classification, score).
Shuffle / collect Intermediate results are gathered. In distributed deployments, this involves message passing through a queue or object store. Results may be partially ordered or grouped before reduction.
Reduce A reduce agent synthesises the intermediate results into a final output. The reduce step itself may be parallelised in a tree structure if intermediate count is large (e.g. 1000 chunks → 10 groups → 1 final synthesis).

Hierarchical reduction

When the number of map outputs exceeds the reduce model's context window, use hierarchical (tree) reduction: reduce groups of N outputs, then reduce those group summaries, and so on until a single output remains. This is the LLM equivalent of a parallel prefix sum.

Python · asyncio hierarchical_map_reduce.py
# Hierarchical map-reduce for large document corpora
import asyncio
from typing import List
from anthropic import AsyncAnthropic

client = AsyncAnthropic()
CHUNK_SIZE = 8000   # tokens per chunk
REDUCE_GROUP = 10  # max summaries per reduce call

async def map_chunk(chunk: str, task: str) -> str:
    """Process a single chunk — the map step."""
    resp = await client.messages.create(
        model="claude-haiku-4-5-20251001",  # fast + cheap for map
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Task: {task}\n\nDocument chunk:\n{chunk}\n\nProvide a structured intermediate result."
        }]
    )
    return resp.content[0].text

async def reduce_batch(intermediates: List[str], task: str, final: bool = False) -> str:
    """Reduce a batch of intermediate results."""
    model = "claude-sonnet-4-6" if final else "claude-haiku-4-5-20251001"
    combined = "\n\n---\n\n".join(intermediates)
    resp = await client.messages.create(
        model=model,
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": f"Task: {task}\n\nIntermediate results to synthesise:\n{combined}"
        }]
    )
    return resp.content[0].text

async def hierarchical_map_reduce(documents: List[str], task: str) -> str:
    # Phase 1: MAP — all chunks processed in parallel
    semaphore = asyncio.Semaphore(20)  # respect rate limits

    async def bounded_map(chunk):
        async with semaphore:
            return await map_chunk(chunk, task)

    intermediates = await asyncio.gather(
        *[bounded_map(doc) for doc in documents]
    )

    # Phase 2: HIERARCHICAL REDUCE — tree reduction
    current_level = list(intermediates)
    while len(current_level) > 1:
        groups = [
            current_level[i:i + REDUCE_GROUP]
            for i in range(0, len(current_level), REDUCE_GROUP)
        ]
        is_final = len(groups) == 1
        current_level = await asyncio.gather(
            *[reduce_batch(g, task, final=is_final) for g in groups]
        )

    return current_level[0]
06 — Patterns

Pipeline parallelism

Pipeline parallelism overlaps sequential stages. Rather than waiting for stage A to finish all its work before stage B begins, each stage starts processing output as soon as stage A produces it. This is streaming applied to agent orchestration.

The classic example in agent systems is streaming token generation into downstream processing. As the LLM streams tokens for step N, a parser begins extracting structured data. By the time the LLM finishes, the extracted data is already ready for step N+1.

Pipeline stages in a RAG agent: While the embedding model processes the user query, the retrieval system pre-warms its connection. While retrieval runs, the prompt template is assembled. While the LLM streams its response, structured extraction begins. Each stage overlaps with the next — the total wall clock time approaches the longest single stage, not their sum.
Python · asyncio streaming pipeline_parallel.py
import asyncio
from anthropic import AsyncAnthropic

client = AsyncAnthropic()

async def stream_and_extract(prompt: str) -> dict:
    """Pipeline: stream LLM response while concurrently extracting structure."""
    buffer = []
    extracted = {}

    async with client.messages.stream(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    ) as stream:
        async for chunk in stream.text_stream:
            buffer.append(chunk)
            current_text = "".join(buffer)

            # Pipeline stage: extract JSON as it streams, don't wait for completion
            if "action" in current_text and not extracted.get("action"):
                extracted["action"] = parse_partial_json(current_text, "action")
                # Fire tool call immediately — don't wait for stream to end
                asyncio.create_task(dispatch_tool(extracted["action"]))

    return extracted

async def pipeline_rag_agent(query: str) -> str:
    # Stage 1+2 OVERLAP: embed query while pre-warming retrieval connection
    embed_task   = asyncio.create_task(embed_query(query))
    warmup_task  = asyncio.create_task(warmup_retriever())
    embedding, _ = await asyncio.gather(embed_task, warmup_task)

    # Stage 2+3 OVERLAP: retrieve while assembling prompt template
    retrieve_task = asyncio.create_task(retrieve_contexts(embedding))
    template_task = asyncio.create_task(build_prompt_template(query))
    contexts, template = await asyncio.gather(retrieve_task, template_task)

    # Stage 3: LLM streams — extraction is pipelined mid-stream
    prompt = template.format(contexts=contexts)
    result = await stream_and_extract(prompt)
    return result
07 — Patterns

Speculative execution

Speculative execution launches work before knowing if it will be needed, discards unused results, and uses the winner of whichever branch resolves first. This trades compute cost for latency — justified when the value of a fast response exceeds the marginal cost of wasted computation.

Speculative decoding at the model level

At the LLM inference level, a small draft model (e.g. Haiku) generates candidate tokens speculatively. The large target model (e.g. Sonnet) verifies multiple tokens in parallel per forward pass. When the draft matches, the large model effectively processes multiple tokens per step. This is how providers like Anthropic and Google achieve significantly higher throughput without changing the target model.

Speculative branching at the orchestration level

At the agent orchestration level, the planner launches multiple next-step branches simultaneously before it knows which one is correct. The first branch to return a valid result is used; others are cancelled.

Python · asyncio speculative_execution.py
import asyncio
from dataclasses import dataclass

async def speculative_route(query: str) -> str:
    """Launch fast and slow paths simultaneously; return first valid result."""

    async def fast_path() -> str:
        # Try cache first — returns instantly or raises CacheMiss
        result = await semantic_cache_lookup(query)
        if not result:
            raise ValueError("cache miss")
        return result

    async def medium_path() -> str:
        # Try RAG + small model — ~800ms
        context = await retrieve_and_summarise(query)
        return await haiku_generate(query, context)

    async def full_path() -> str:
        # Full agentic pipeline with tool calls — 3-6s
        return await full_agent_pipeline(query)

    # Stagger starts: fast fires immediately, medium after 50ms, full after 200ms
    # This avoids burning tokens on slow path if fast path will win
    tasks = {
        asyncio.create_task(fast_path()): "fast",
        asyncio.create_task(_delayed(medium_path, 0.05)): "medium",
        asyncio.create_task(_delayed(full_path, 0.20)): "full",
    }

    pending = set(tasks.keys())
    while pending:
        done, pending = await asyncio.wait(
            pending, return_when=asyncio.FIRST_COMPLETED
        )
        for task in done:
            if not task.exception():
                # Cancel all remaining tasks — we have our answer
                for p in pending: p.cancel()
                return task.result()

    raise RuntimeError("All speculative paths failed")

async def _delayed(coro_fn, delay: float):
    await asyncio.sleep(delay)
    return await coro_fn()
Cost discipline: Speculative execution wastes tokens on paths that are cancelled. Always stagger launch delays so cheap paths have a chance to win before expensive paths start. Monitor your cancel rate — if the full path wins more than 10% of the time, your fast paths are not worth the speculative overhead.
08 — Patterns

The actor model

The actor model structures agents as independent units ("actors") that communicate exclusively via message passing, hold private internal state, and can create child actors. No shared memory. No direct method calls. This maps elegantly to agentic systems where agents must be independently scalable, failure-isolated, and geographically distributed.

Frameworks like Microsoft AutoGen and agent implementations on Ray implement actor semantics for LLM-based agents. Each agent actor has a mailbox (message queue), a behaviour function (the LLM call + tool execution logic), and an address (durable identity in the distributed system).

Actor propertyAgent mappingCloud primitive
MailboxTask queue for incoming agent instructionsSQS FIFO / Service Bus / Pub/Sub
Private stateAgent memory (context, tool history)Redis key per agent ID / Cosmos item
AddressDurable agent session IDUUID + registration in Consul / etcd
Child actorsSub-agent spawningNew task/pod creation + queue registration
SupervisionOrchestrator restarts failed agentsKubernetes restartPolicy / Step Functions retry
09 — Engineering

Async-first design

The single most impactful decision in agentic system design is committing to async-first architecture. Every LLM call, tool invocation, database read, and network request should be non-blocking by default. Synchronous calls inside an async runtime are a hidden bottleneck that prevents all parallelism patterns from working correctly.

The event loop model

Python's asyncio, Node.js's event loop, and Go's goroutines all provide cooperative multi-tasking: while one coroutine awaits a network response, the event loop runs other ready coroutines. This gives you thousands of concurrent "in-flight" operations on a single thread with no threading overhead.

Python · asyncio patterns async_patterns.py
# THE FOUR CORE ASYNC PARALLELISATION PRIMITIVES

# 1. asyncio.gather — N independent coroutines, all results needed
results = await asyncio.gather(
    llm_call(prompt_a),
    retrieve_context(query),
    call_tool("database", params),
    return_exceptions=True  # don't abort on single failure
)

# 2. asyncio.wait FIRST_COMPLETED — speculative, race to first valid result
done, pending = await asyncio.wait(
    [task_a, task_b, task_c],
    return_when=asyncio.FIRST_COMPLETED
)
winner = done.pop().result()
for p in pending: p.cancel()

# 3. asyncio.Semaphore — bounded concurrency (respect rate limits!)
sem = asyncio.Semaphore(20)  # max 20 concurrent LLM calls
async def bounded_call(prompt):
    async with sem:
        return await llm_call(prompt)
results = await asyncio.gather(*[bounded_call(p) for p in prompts])

# 4. asyncio.Queue — producer/consumer pipeline between stages
queue: asyncio.Queue = asyncio.Queue(maxsize=100)

async def producer():
    async for item in document_stream():
        await queue.put(item)  # blocks if queue is full (backpressure)
    await queue.put(None)   # sentinel

async def consumer():
    while True:
        item = await queue.get()
        if item is None: break
        await process(item)

# Run producer and N consumers concurrently
await asyncio.gather(
    producer(),
    *[consumer() for _ in range(10)]
)
Blocking calls kill your event loop. Never call a synchronous I/O function (requests.get, psycopg2 queries, time.sleep) inside an async function without wrapping it in loop.run_in_executor(). One blocking call stalls the entire event loop — all other coroutines stop until it returns.
10 — Engineering

State management under parallelism

Parallelism introduces the hardest class of bugs in software engineering: race conditions, lost updates, and inconsistent reads. In agent systems, these manifest as agents reading stale context, tool results overwriting each other, or duplicate task execution.

The golden rule

Agent state must be partitioned by agent ID and never shared across concurrent agents without explicit synchronisation. The simplest pattern: each agent owns its own Redis key namespace (agent:{session_id}:*). No agent reads or writes another agent's namespace without explicit coordination via a message.

State typeStorageAccess patternConcurrency control
Session contextRedis (TTL 30min)Read-heavy, write on turnOptimistic locking (WATCH/MULTI)
Tool resultsRedis / object storeWrite once, read manyImmutable keys (write once)
Task statusRedis / PostgresSingle writer per taskCAS (compare-and-swap)
Agent memoryVector DB + RedisRead-heavy retrievalEventual consistency acceptable
Audit logAppend-only Postgres / S3Write-only from agentsAppend-only (no updates)
Global configRedis / etcdRead-heavy, rare writesRead replicas + cache

Idempotency is not optional

In distributed parallel systems, messages are delivered at least once. Tool calls will be retried. Agent steps will be re-executed on failure. Every agent action must be designed as an idempotent operation: executing it twice must produce the same result as executing it once. Use idempotency keys on all write operations.

11 — Engineering

LangGraph parallel node execution

LangGraph is the dominant orchestration framework for production agentic systems. It models agent workflows as typed state graphs where nodes are async functions and edges define execution order. Parallel branches are expressed natively by the graph topology.

Python · LangGraph parallel_langgraph.py
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated, List
import operator, asyncio

class AgentState(TypedDict):
    query: str
    db_result: str
    rag_result: str
    api_result: str
    # Annotated with operator.add means results are MERGED not overwritten
    tool_outputs: Annotated[List[str], operator.add]
    final_answer: str

# Define parallel nodes — each is an independent async function
async def query_database(state: AgentState) -> dict:
    result = await db_lookup(state["query"])
    return {"db_result": result, "tool_outputs": [result]}

async def retrieve_context(state: AgentState) -> dict:
    result = await vector_search(state["query"])
    return {"rag_result": result, "tool_outputs": [result]}

async def call_external_api(state: AgentState) -> dict:
    result = await api_fetch(state["query"])
    return {"api_result": result, "tool_outputs": [result]}

async def synthesise(state: AgentState) -> dict:
    # This node only runs after all parallel branches complete (fan-in)
    combined = "\n".join(state["tool_outputs"])
    answer = await llm_synthesise(state["query"], combined)
    return {"final_answer": answer}

# BUILD THE GRAPH — topology encodes parallelism
graph = StateGraph(AgentState)
graph.add_node("db",  query_database)
graph.add_node("rag", retrieve_context)
graph.add_node("api", call_external_api)
graph.add_node("synth", synthesise)

# Fan-out from START to three parallel branches
graph.set_entry_point("db")
graph.set_entry_point("rag")   # LangGraph executes all entry points in parallel
graph.set_entry_point("api")

# Fan-in: all branches must complete before synth runs
graph.add_edge("db",  "synth")
graph.add_edge("rag", "synth")
graph.add_edge("api", "synth")
graph.add_edge("synth", END)

app = graph.compile()
result = await app.ainvoke({"query": "What is the customer's current balance?"})
12 — Infrastructure

Containerised architecture for parallel agents

Containers are the natural unit of deployment for parallel agent workloads. Each agent type runs in its own container image: identical runtime environment, predictable resource envelope, independently scalable, independently deployable. Kubernetes orchestrates them at scale.

Container topology for a parallel agent system

API Gateway / Nginx ingress Orchestrator pod LangGraph · FastAPI replica: 3 Orchestrator pod LangGraph · FastAPI replica: 3 Orchestrator pod LangGraph · FastAPI replica: 3 Task queue — Kafka / SQS FIFO / Service Bus Planner worker claude-sonnet scale: 0-20 Tool executor function calls scale: 0-50 RAG retriever vector queries scale: 0-30 Embed worker batch embedding scale: 0-10 Validator output guardrails scale: 0-10 Redis cluster Vector DB Postgres / Aurora Object store / S3 KEDA autoscale

Key container design decisions

One concern per container image Planner agents, tool executor agents, RAG retriever agents, and validator agents each have their own image. This enables independent scaling: a spike in retrieval workload scales retriever pods without touching planner pods.
Stateless containers, stateful data layer Containers carry no state. Session data, tool results, and agent context all live in external stores (Redis, Postgres). Any pod can be killed and replaced without data loss. This is the foundational requirement for horizontal autoscaling.
Resource requests calibrated to model size LLM-calling containers are I/O-bound, not CPU-bound. Set CPU requests low (0.25–0.5 vCPU), memory requests moderate (512MB–2GB depending on context size). GPU-based inference containers are separate — they have very different resource profiles.
Event-driven autoscaling (KEDA) Scale worker pod replicas based on queue depth, not CPU. A queue of 1000 pending tool-call tasks with 5 pods should immediately scale to 50 pods. KEDA reads the queue length and drives Kubernetes HPA accordingly.
13 — Infrastructure

Distributed systems design

A parallel agent system at production scale is a distributed system. All of the classical distributed systems problems apply: network partitions, message ordering, consistency vs availability trade-offs, and clock skew. These are not theoretical concerns — they manifest in production agent bugs.

Exactly-once semantics for agent tasks

Agent tool calls must not execute twice. Sending an email, writing a database record, or charging a payment must be idempotent with exactly-once guarantees. The standard pattern: assign a unique idempotency key to every task before it enters the queue. Workers check the key against a deduplication store (Redis SET or database unique constraint) before executing.

Distributed rate limiting

Forty parallel agents each making LLM calls simultaneously will exceed your provider's tokens-per-minute quota within seconds. A centralised rate limiter is mandatory. The token bucket algorithm implemented in Redis using the INCR/EXPIRE or Redis Cell module is the standard approach.

Python · Redis rate limiter distributed_rate_limiter.py
import redis.asyncio as redis
import time, asyncio

class TokenBucketLimiter:
    """Distributed token bucket — shared across all agent pods via Redis."""
    def __init__(self, r: redis.Redis, key: str,
                 capacity: int, refill_rate: float):
        self.r = r
        self.key = key          # e.g. "rate:claude:tokens"
        self.capacity = capacity # max tokens (e.g. 100_000 TPM)
        self.refill_rate = refill_rate  # tokens per second

    async def acquire(self, tokens: int) -> bool:
        """Atomically acquire tokens. Returns True if granted, False if throttled."""
        now = time.time()
        # Lua script runs atomically in Redis — no race conditions
        script = """
        local key, capacity, rate, tokens, now = KEYS[1], ARGV[1], ARGV[2], ARGV[3], ARGV[4]
        local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
        local current = tonumber(bucket[1]) or capacity
        local last_refill = tonumber(bucket[2]) or now
        local elapsed = now - last_refill
        local refilled = math.min(capacity, current + elapsed * rate)
        if refilled >= tokens then
            redis.call('HMSET', key, 'tokens', refilled - tokens, 'last_refill', now)
            redis.call('EXPIRE', key, 3600)
            return 1
        end
        return 0
        """
        result = await self.r.eval(
            script, 1, self.key,
            self.capacity, self.refill_rate, tokens, now
        )
        return bool(result)

    async def acquire_or_wait(self, tokens: int, max_wait: float = 30.0):
        deadline = time.time() + max_wait
        while time.time() < deadline:
            if await self.acquire(tokens):
                return
            await asyncio.sleep(0.1)
        raise TimeoutError(f"Rate limit: could not acquire {tokens} tokens in {max_wait}s")
14 — Cloud: AWS

AWS parallelisation deployment

AWS Parallel execution stack on AWS

AWS provides the most mature serverless primitives for agentic parallelisation. The combination of Kinesis enhanced fanout, Lambda provisioned concurrency, and Step Functions Express is the industry-leading pattern for high-throughput agent pipelines.

Step Functions Express + Map State Native fan-out: runs up to 40 parallel branches simultaneously. Map state distributes array items across parallel Lambda invocations. 100k executions/sec throughput.
Lambda Provisioned Concurrency Pre-initialised execution environments. Zero cold start for tool-calling agents. Scale to thousands of concurrent executions. SnapStart for JVM runtimes.
Kinesis Enhanced Fanout Each consumer gets dedicated 2MB/s read throughput. Multiple agent types consume the same stream without competing. Order preserved per shard key.
SQS FIFO + DLQ Exactly-once delivery for idempotent agent tasks. Dead letter queue isolates poison-pill tasks without blocking the pipeline. Message groups for per-session ordering.
ECS Fargate + SPOT Long-running agent containers. SPOT capacity for 70% cost reduction. Target tracking autoscaling on SQS queue depth via CloudWatch metrics.
Bedrock Batch Inference Submit thousands of LLM requests as a batch job. Processed asynchronously at 50% lower cost than on-demand. Ideal for overnight bulk agent tasks.

Step Functions Map state — fan-out pattern

AWS Step Functions ASLstate-machine.json
{
  "ProcessDocuments": {
    "Type": "Map",
    "ItemsPath": "$.documents",
    "MaxConcurrency": 40,       // 40 parallel Lambda invocations
    "ToleratedFailurePercentage": 10,  // allow 10% failures
    "Iterator": {
      "StartAt": "AnalyseChunk",
      "States": {
        "AnalyseChunk": {
          "Type": "Task",
          "Resource": "arn:aws:lambda:ap-southeast-2:*:function:agent-map-worker",
          "Retry": [{
            "ErrorEquals": ["ThrottlingException"],
            "IntervalSeconds": 2,
            "MaxAttempts": 5,
            "BackoffRate": 2.0,
            "JitterStrategy": "FULL"
          }]
        }
      }
    },
    "Next": "ReduceResults"
  }
}
15 — Cloud: Azure

Azure parallelisation deployment

Azure Parallel execution stack on Azure

Azure's standout for agentic parallelisation is Durable Functions — it provides stateful fan-out/fan-in orchestration as a first-class programming model with human-in-the-loop checkpoints and event-driven scaling via KEDA.

Durable Functions Fan-out Activity function fan-out runs 500+ parallel tasks in a single orchestration. Results automatically aggregated. Sub-orchestrations for nested parallelism. Built-in compensation on failure.
AKS + KEDA Scale worker pods from 0 to N based on Event Hubs consumer lag. More precise than CPU-based HPA for I/O-bound agent workloads. Integrates with Azure Monitor for custom metrics.
Event Hubs Kafka Protocol Partition-based parallelism: 32 partitions = 32 max parallel consumers. Exactly-once delivery with consumer group coordination. Ordering keys for per-session sequencing.
Azure OpenAI PTU Provisioned Throughput Units eliminate token-per-minute throttling. Fixed capacity means parallel agents don't compete for quota. Essential for sustained high-concurrency inference.
Cosmos DB Bulk Executor Parallel writes at millions of RU/s for agent result persistence. Change Feed drives downstream event triggers. Partition key by session ID for co-located agent state.
Azure Container Apps Serverless containers with built-in KEDA scaling. Simpler than AKS for stateless agent worker pools. Scale to zero when idle. HTTP and queue-based scaling triggers.

Durable Functions fan-out/fan-in

Python · Durable Functionsorchestrator.py
import azure.durable_functions as df

def orchestrator_function(context: df.DurableOrchestrationContext):
    documents = context.get_input()

    # Fan-out: fire all activity functions in parallel
    parallel_tasks = [
        context.call_activity("AnalyseDocument", doc)
        for doc in documents
    ]

    # Fan-in: wait for ALL to complete (or use Task.any for first-wins)
    results = yield context.task_all(parallel_tasks)

    # Human-in-the-loop: pause until external approval event
    approval = yield context.wait_for_external_event("ApprovalReceived")
    if approval["approved"]:
        yield context.call_activity("PublishResults", results)

    return results

main = df.Orchestrator.create(orchestrator_function)
16 — Cloud: GCP

GCP parallelisation deployment

GCP Parallel execution stack on GCP

GCP's Pub/Sub infinite horizontal scale and Vertex AI Reasoning Engine's managed agent runtime make it the strongest platform for agents that need to scale to unpredictable, very high concurrency with minimal operational overhead.

Cloud Pub/Sub + ordering keys Infinitely scalable event backbone. Ordering keys guarantee per-session message ordering at global scale. Exactly-once delivery. Push subscriptions trigger Cloud Run immediately without polling.
Cloud Run (concurrency per container) A single Cloud Run instance handles hundreds of concurrent requests. CPU is only allocated during active request processing. Scales to zero, 0 cold start with min-instances=1.
Vertex AI Reasoning Engine Fully managed agent runtime — deploy a LangChain/LangGraph agent and Google handles scaling, sessions, and tool execution infrastructure. Removes need to build orchestration platform.
Cloud Dataflow (Apache Beam) Streaming parallel agent pipelines over unbounded data. Parallel DoFn execution across workers. Auto-scaling based on data volume. Integrates with Pub/Sub for exactly-once processing.
GKE Autopilot + HPA No node management. Custom metrics HPA scales on Pub/Sub backlog. Workload Identity for secure Vertex AI access. Node auto-provisioning for GPU inference pods.
Vertex AI Batch Predictions Submit thousands of prompts as a BigQuery table. Gemini processes in parallel across Google's infrastructure. Results written back to BigQuery. 40% cheaper than online predictions.

Dataflow parallel agent pipeline

Python · Apache Beam / Dataflowagent_pipeline.py
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

class AgentAnalysisDoFn(beam.DoFn):
    """Parallel per-element agent execution — runs on each Dataflow worker."""
    def setup(self):
        # Called once per worker — initialise client here, not per element
        from anthropic import Anthropic
        self.client = Anthropic()

    def process(self, element):
        response = self.client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=512,
            messages=[{"role": "user", "content": element["text"]}]
        )
        yield {
            "id": element["id"],
            "analysis": response.content[0].text
        }

options = PipelineOptions([
    "--runner=DataflowRunner",
    "--project=my-project",
    "--region=australia-southeast1",
    "--max_num_workers=100",   # up to 100 parallel workers
])

with beam.Pipeline(options=options) as p:
    (p
     | "Read"    >> beam.io.ReadFromPubSub(topic="projects/*/topics/docs")
     | "Analyse" >> beam.ParDo(AgentAnalysisDoFn())  # massively parallel
     | "Write"   >> beam.io.WriteToBigQuery("project.dataset.results")
    )
17 — Operations

Observability for parallel agent systems

Debugging a sequential system is hard. Debugging a parallel agent system without proper observability is nearly impossible — race conditions, partial failures, and unexpected orderings are invisible without distributed tracing.

The three mandatory signals

Distributed traces

Every agent task carries a trace context (trace-id, span-id) propagated through all message headers and LLM API calls. A single user request produces one trace that spans all parallel branches. OpenTelemetry is the standard.

Fan-out metrics

Track branch count per fan-out, p50/p99/max branch duration, and fan-in wait time (how long the aggregator waits for the slowest branch). The slowest branch is always your bottleneck — identify it.

Queue depth + lag

Consumer lag on the task queue is the single most important operational metric. Lag growing means your agent workers cannot keep up. Lag flat means you are stable. Lag shrinking means you are over-provisioned.

The critical path problem

In parallel execution, the slowest branch determines total latency. Identifying which branch is the critical path requires per-branch timing in your traces. Once identified, optimise the critical path first — speeding up non-critical paths has zero impact on total latency.

Langfuse is the de-facto tracing tool for LLM agent systems. It provides native support for LangGraph traces, token usage per span, cost attribution, and parallel branch visualisation. For cloud-native deployments, combine Langfuse for LLM-specific traces with OpenTelemetry for infrastructure traces, unified in Grafana.
18 — Operations

Failure modes unique to parallelism

Failure modeCauseDetectionMitigation
Thundering herd All parallel workers retry simultaneously after a transient failure, causing a traffic spike that re-triggers the failure Correlated 429 spike across workers Full jitter exponential backoff — never fixed-delay retry in parallel workers
Head-of-line blocking One slow branch holds up the fan-in aggregator, blocking the entire pipeline despite other branches completing Fan-in wait time p99 much higher than p50 Timeout individual branches; aggregate with partial results
Cascading failures A failed tool call causes agent to retry with different parameters, amplifying load on an already-struggling service Error rate rising across unrelated services Circuit breaker per tool; fail fast, don't retry immediately
Lost updates Two concurrent agents write to the same state key without synchronisation, second write silently overwrites first Missing data in agent outputs; inconsistent audit logs Optimistic locking (version field + CAS); append-only logs
Partial fan-out results Aggregator receives M of N branch results and treats them as complete, producing incorrect synthesis Output quality degradation; missing data signals Explicit result count assertion before reduction; mark incomplete syntheses
Rate limit cascade High parallelism exhausts LLM provider quota, all workers start failing, tasks pile up in queue Queue depth rising; 429 error rate spike Shared token bucket limiter; provisioned throughput; model routing to spare capacity
19 — Operations

The cost model of parallelisation

Parallelisation reduces latency but increases cost. Every parallel branch consumes tokens, compute, and API quota. The cost-latency trade-off must be explicit and measured, not assumed.

Parallelism is a latency loan. You pay in compute cost today to avoid paying in user experience tomorrow. The interest rate is how much each parallel branch costs relative to the sequential path it replaces.

Cost-aware systems design

Optimising the cost-latency trade-off

Model routing by task complexity Map steps use fast, cheap models (Haiku at $0.001/1k tokens). Reduce/synthesis uses capable models (Sonnet at $0.015/1k tokens). Fan-out multiplies the cheap model cost — keep map models inexpensive.
Semantic caching Before firing a parallel branch, check if the result is already cached by semantic similarity. Cache hit rate of 30-40% on tool calls is typical in production, directly reducing both latency and cost.
Lazy parallelism Not every request needs maximum parallelism. Classify incoming requests and apply proportional resources: simple queries get sequential execution, complex research tasks get full fan-out. Avoids over-spend on simple traffic.
Batch for async workloads Tasks with flexible latency (analytics, overnight processing, bulk ingestion) should use batch inference APIs rather than real-time parallel calls. Batch APIs cost 40-50% less for equivalent throughput.
Cancel promptly In speculative execution, cancel losing branches immediately. A branch that runs for 500ms after being superseded costs tokens for zero user value. Use asyncio task cancellation or HTTP request cancellation aggressively.
Rule of thumb: The break-even point for parallelisation is when the latency reduction (in seconds) multiplied by your revenue-per-second value exceeds the additional token cost. For customer-facing agents with measurable conversion or NPS impact, this break-even is usually reached at 3+ parallel branches.