Pathways: Single Controller Asynchronous Distributed Infrastructure for Machine Learning
The paper https://arxiv.org/abs/2203.12533 introduces Pathways[1], a new large-scale orchestration layer and execution model designed to train massive AI models across thousands of accelerators. Pathways is designed to support multimodal, sparse architectures that are capable of solving thousands or millions of tasks. It is the distributed runtime that powers Google’s internal large-scale training and inference infrastructure. It coordinates data transfer over thousands of accelerators achieving ∼ 100% accelerator utilization when running SPMD (single program, multiple data) computations over 2048 TPUs using sharded dataflow design.