Pathways: Single Controller Asynchronous Distributed Infrastructure for Machine Learning

The paper https://arxiv.org/abs/2203.12533 introduces Pathways[1], a new large-scale orchestration layer and execution model designed to train massive AI models across thousands of accelerators. Pathways is designed to support multimodal, sparse architectures that are capable of solving thousands or millions of tasks. It is the distributed runtime that powers Google’s internal large-scale training and inference infrastructure. It coordinates data transfer over thousands of accelerators achieving ∼ 100% accelerator utilization when running SPMD (single program, multiple data) computations over 2048 TPUs using sharded dataflow design.

Problem & Motivation

Before Pathways, most large-scale ML systems followed one of two paradigms, both of which had limitations:

Dense and inefficient models: In a dense model, every parameter works on every input, dense models are often preferred because they are easier to train and don’t require complex routing logic, however they are inefficient because every single neuron (parameter) fires for every single input consuming large amounts of energy. In this paradigm, frameworks typically run a copy of the user code on every host, if there is need for Host A to do Task 1 while Host B does Task 2 then one way to achieve this is to write conditional logic which introduces complexity. This makes dense models rigid and struggles with pipelining or multi-tasking (different nodes doing different things).

Single-controller Systems: Older systems used a central controller to dispatch work. Historically, this model was considered too slow for high-performance ML compared to multi-controller systems (like JAX). The primary reasons why a traditional single controller can become a bottleneck are:

In a single-controller setup, the master often sits on a separate machine connected to the workers via the Datacenter Network (DCN). Sending a command over the DCN can take 10 to 100 times longer than the communication inside a multi-controller system, where commands are issued locally over the PCIe bus.

The tasks are serialized using a queue, even if each task takes only a microsecond, the time it takes the controller to loop through thousands of workers adds up, the controller’s own CPU can become overwhelmed just trying to keep track of the status of thousands of parallel tasks.

The accelerators remain under-utilized because even if one task is slightly slower (a straggler) due to a minor hardware hiccup or network congestion, the entire cluster stops and waits for that one node, wasting massive amounts of expensive compute time.

In older single-controller systems, the controller had to hold the entire computational graph in memory. As models grew to billions of parameters, the sheer size became too much for one machine to manage, leading to high memory overhead and slow compilation times.

Pathways is designed to support large-scale, multimodal, sparse architectures.

Sparse Architectures (Mixture of Experts)

In a sparse architecture, the model is divided into many "experts." For any given input, a "router" selects only the most relevant experts to handle the task. The benefit is that it allows to have a model with 1 Trillion parameters while only using the compute of a 50 Billion parameter model. This is very similar to how human brains work, there are different parts of the brain that specialize in different tasks .

Multimodal Integration

Multimodal models can process, understand, and generate information across multiple types of data such as text, images, audio, video all at the same time.

Fusion of Sparse and Multi modal

Following is an example of how spare and multi modal, works (Gemini or ChatGPT):

Users upload a photo of a broken car and send a prompt to repair (Input)

The vision encoder turns the pixels into "tokens." (Multimodal)

The router sees the visual tokens (car parts) and the text (repair instructions). It sends the data to the "Mechanical Engineering" expert. (Sparse)

The model generates a step-by-step text guide by only using a fraction of its total power, making the response fast enough for a chat interface.

The Pathways paper proves that one can keep the flexibility of a single controller without the slowness using:

Asynchronous Dispatch: The controller doesn't wait for a task to finish, it dispatches "futures" for the next 10 tasks immediately.

Delegated Coordination: It uses a sub-system named Plaque to handle the handshakes between hosts, so the controller doesn't get bogged down in the details of data movement. Because the controller isn't involved in every data transfer, it scales well. The controller handles the strategy, while Plaque handles the logistics of the handshakes across the network. My guess is that each host has a Plaque proxy, these proxies might be communicating with each other directly to ensure they are all ready for a data transfer, after handshake the data is transferred directly between the memories of the TPU hosts via high-speed interconnects (like ICI), bypassing the CPU and the controller entirely. If one host in a cluster of 5,000 fails, Plaque can report the specific handshake failure back to the controller (resource manager) which then re-routes the logic without having to restart the entire global system.

Island

In the paper, an "Island" refers to a high-performance, tightly coupled cluster of accelerators (like TPUs or GPUs) that are connected by a specialized, extremely fast internal network. In a massive data center, it is not possible or difficult to connect thousands of chips with the same speed. Instead, they are organized into groups:

The Island (The "Pod"): A physical group of accelerators (e.g., 512 or 2048 TPU cores) that share a dedicated, high-bandwidth, low-latency interconnect like Google’s ICI (Inter-Core Interconnect) or NVIDIA’s NVLink. Inside an island, chips can talk to each other almost as fast as if they were on the same circuit board.

The Datacenter Network (DCN): The regular "internet-like" network that connects different islands to each other. This network is significantly slower and has much higher latency than the internal connections within an island.

Before Pathways, most ML frameworks were living on a single island. If there was a need to train a model so big that it required 4096 chips, but the islands only had 2048 chips each, the software couldn't easily bridge the gap.

The Pathways changes this by:

Virtualizing the Island: It allows a user to request a "virtual slice" of compute that can actually span across multiple physical islands.

Hiding the details: It uses asynchronous dispatch and futures to send data between islands in the background. While one island is busy calculating, the system is already moving the data across the DCN (data center network) to the next island so it’s ready when needed.

Key Architectural Innovations

Pathways introduces a "client-server" architecture where a user’s Python script acts as a client that sends "compiled functions" (XLA computations) to a long-running Pathways backend, the dedicated Pathways backend (the server) manages execution across thousands of accelerators.

Sharded Dataflow Graph: Pathways represents the program as a graph where nodes are operators (sharded across devices) and edges are "futures" (placeholders for data). Pathways breaks this graph into "shards", each shard is mapped to a set of TPU’s

Asynchronous Dispatch: Unlike traditional systems that wait for a task to finish before enqueuing the next, Pathways "fires and forgets." It dispatches work to accelerators before the inputs are even ready, allowing the control plane to stay ahead of the hardware.

Single-Controller Model: Pathways adopts a single-controller model but it enables MPMD (Multiple Programs, Multiple Data) where a central "resource manager" and per-island schedulers coordinate execution.

Resource Manager: This global component manages all devices across the various accelerator islands.It tracks the health and availability of every single chip, when a user dispatches a Sharded Dataflow Graph, the Resource Manager looks at the requirements and "maps" the shards to the best available physical chips. If a chip fails during a run, the Resource Manager can "re-virtualize" the job by moving that specific shard to a healthy chip without restarting the entire 10,000-chip job.

Per-Island Schedulers: Each island of accelerators has a centralized scheduler that manages the specific timing of computations for that island.

Gang-Scheduling: It can efficiently "gang-schedule" computations, ensuring that thousands of accelerators start a parallel task at the exact same time, which is critical for minimizing "straggler" effects. Stragglers are tasks that take significantly longer to complete than the average task of the same type, and can cause a number of problems for data processing pipelines, including increased latency, reduced throughput, and increased costs. Traditionally stragglers are dealt by first identifying using in-context observability and then mitigation.

Multi-Pod Scaling: Pathways is designed to transparently span multiple TPU pods. Usually, a TPU pod is a physical limit for high-speed interconnects; Pathways allows a model to be sharded across multiple pods as if they were one giant machine.

Distributed Coordination and Plaque: Plaque is the internal Google technology (closed-source) used for coordination. Plaque is a distributed, sharded dataflow system that handles the "handshakes" between hosts. It ensures that when one set of accelerators finishes a task, the next set is notified instantly without needing to talk back to the central controller.

Hardware Constraints: Accelerators (TPUs/GPUs) are grouped into islands (Pods) with high-bandwidth, low-latency Inter-Core Interconnects (ICI). However, communication between islands happens over the Datacenter Network (DCN), which has much lower bandwidth and higher latency. Pathways is designed to abstract this, making a multi-island cluster look like a single machine. By using speculative execution and enqueuing operations onto the accelerator’s hardware queue in advance, asynchronous abstraction effectively masks dispatch latency for small operations.

Performance & Key Takeaways

The paper provides several benchmarks to prove that this flexible architecture doesn't sacrifice speed:

Performance Parity: On standard dense workloads, Pathways achieves ~100% accelerator utilization, matching the performance of specialized systems like JAX even at scales of 2048 TPUs.

Complex Parallelism: It demonstrates high efficiency for pipelined models (splitting a model across stages) and sparse models (where only parts of the hardware are active at a time), which are difficult for traditional systems to manage.

Generalization: It moves away from one off clusters for specific models toward a shared pool of thousands of accelerators that can run many different tasks simultaneously.

Virtualization: It allows for islands of compute where resources can be dynamically reallocated without restarting the entire system, improving hardware utilization across a data center.

Summary

The paper gives an overview of internal infrastructure for workloads used by Google. Pathways is a system that allows Google to link thousands of AI chips together so they work as one giant, seamless unit. It cuts out the waiting time between chips, allowing them to pass information back and forth instantly, which makes training huge AI models much faster and more efficient. The paper is from 2022, since the technology, especially AI is moving at warp speed, it is hard to guess if Pathways as described in the paper is still being used in its original form or whether it has evolved.

Because Plaque handles the handshake between hosts, the users and the programmers see the entire data center as a single, giant accelerator, they don't have to write code to manage individual handshakes. Users simply write the model logic, and Plaque handles the billions of transfers happening under the hood. Because of the usage of closed source technologies like Plaque, it’s difficult to mimic this infrastructure outside of Google.

Pathways took Google from a world of "One model per task" to "One architecture for everything", it was the bridge between the original Transformer and the modern Gemini era.

References

Pathways: Asynchronous Distributed Dataflow for ML - https://arxiv.org/abs/2203.12533
Pathways: A next-generation AI architecture https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/

Search This Blog

Gaurav Blog