Posts

Architectural Analysis of the 2021 Roblox Infrastructure Outage

Image
The Roblox service outage of October 2021, which persisted for nearly 73 hours and displaced over 50 million daily active users, serves as a lesson of solid architectural decisions in distributed systems engineering. While the immediate precipitating event was identified as a performance degradation within the open source BoltDB storage engine triggered by high contention from a newly enabled streaming feature in HashiCorp's Consul, the event's magnitude cannot be attributed to a single software defect. Rather, the prolonged duration and global scope of the failure were results of architectural decisions that favored tight coupling over isolation, and global consistency over localized availability. At scale, reliability is no longer a function of component quality but of topological design. Beyond a mere retrospective, I proposed a few solutions for "Hyper Scale Reliability" tailored towards specific failures for this outage. While these are not the o...

Pathways: Single Controller Asynchronous Distributed Infrastructure for Machine Learning

Image
  The paper https://arxiv.org/abs/2203.12533 introduces Pathways[1], a new large-scale orchestration layer and execution model designed to train massive AI models across thousands of accelerators. Pathways is designed to support multimodal, sparse architectures that are capable of solving thousands or millions of tasks. It is the distributed runtime that powers Google’s internal large-scale training and inference infrastructure. It coordinates data transfer over thousands of accelerators achieving ∼ 100% accelerator utilization when running SPMD (single program, multiple data) computations over 2048 TPUs using sharded dataflow design. 

Linearizability, Serializability, External Consistency, Strict Serializability

Image
    This is a reference blog to explain these  terms linearizability, serializability, external consistency, strict serializability since I have used them at various places in previous blogs.

CAP Theorem Proof, Narrow Definitions and Modern Context

Image
This blog focuses on CAP theorem proof, the narrow definitions and the modern context i.e. the relevance of CAP theorem in system design and architecture today.  

Paper Review - Gray Failure: The Achilles’ Heel of Cloud-Scale Systems

Image
  A few years ago I came across a paper by Microsoft Research on Observability in distributed systems titled "Gray Failure: The Achilles’ Heel of Cloud-Scale Systems", the link is in the reference section. The core concept in the paper is about differential observability. While standard observability focuses on gathering data to reflect a system's state, differential observability discussed in paper highlights the failure of that data to accurately represent the user experience.  Observability in distributed systems is the measure of how well you can understand the internal state of your system based on the external data (signals) it produces, traditionally Observability is built on three foundational types of data namely logs, metrics, tracing (end-end journey of a single request as it moves through distributed systems).   

Paper Review - Paxos vs Raft: Consensus on distributed consensus

Image
    Reaching consensus in a distributed system among nodes or processes is a complex problem because the network is not reliable, and the participants may experience failures. Two popular algorithms are Paxos (or its variant Multi-Paxos) and Raft. Paxos was proposed by the computer scientist Leslie Lamport (a Turing Award winner), it was a seminal paper in computer science which laid the theoretical and practical groundwork for all subsequent distributed consensus algorithms. Consensus protocols have wide applications including payment systems, storage systems (data replication e.g. Google Spanner or CockroachDB), distributed locking (shared resources in distributed systems), blockchain technology. Heidi Howard and Richard Mortimer from University of Cambridge in the UK wrote a paper comparing two algorithms “Paxos vs Raft: Have we reached consensus on distributed consensus?” ( https://lnkd.in/g7yVWjAq ). It is an interesting read, I tried to summarize the paper in the foll...

Review - Environmental impact of delivering AI

Image
Review - Environmental impact of delivering AI