Component detail
explanation, examples & design rationale
Tap any component in the diagram to see a detailed breakdown and design rationale.
Fan-out at independence points
Parallelisation is only possible where there is no dependency between tasks. The core skill is identifying which tasks have no edges between them in the work graph — those are the fan-out points. Everything else is either a sequential step or a merge.
The two latency equations
Sequential: total = sum of all steps. Parallel: total = slowest single step. If dense search takes 40ms and BM25 takes 10ms and you run them sequentially, you wait 50ms. In parallel you wait 40ms. At scale this difference compounds enormously.
Ingestion vs retrieval grain
Ingestion parallelism is about throughput — processing thousands of documents as fast as possible, with hours or days available. Retrieval parallelism is about latency — shaving milliseconds from a user-facing query that must complete in under 200ms. The techniques overlap but the pressure is opposite.
The GPU is always the bottleneck
No amount of CPU-level parallelism overcomes a GPU bottleneck. All workers queue at the embedding model. The fix is always one of: bigger batch size, more GPUs, or offload to a hosted API that handles GPU parallelism on the provider's side.
asyncio vs multiprocessing
asyncio is for I/O-bound work: waiting for HTTP responses, database writes, S3 uploads — anything where the process is idle waiting for network. Multiprocessing / Ray is for CPU-bound work: PDF parsing, text processing, chunking logic. Using asyncio for CPU-bound tasks does not help — the GIL still blocks.