Lessons from Distributed Systems and Team Dynamics

Distributed systems computing is one of the most fascinating areas of practical computational theory. It is the foundational underpinning of all major systems across all verticals: finance, social media, AI, government, healthcare, cybersecurity. The internet is a distributed system. If you are building a multi-node system, you are building distributed systems. The theory is present whether or not you realize it.

About 15 years ago during the rise of Big Data – a movement characterized by a popular questioning of scale-up single-node ACID models for consistency – I was immersing myself in distributed systems design and theory. It was incredible seeing emergent systems that supported creative models for practical state management in distributed systems: Netezza’s MPP, Riak’s key-value store, Cassandra and Hbase’s wide column stores, MongoDB’s document storage, or Lucene-based engines like Elasticsearch. There was (and still is) an incredible array of options from which to massively scale. (this didn’t even touch on event bus and synchronization technologies)

Around the same time, there was a rising wave of the leaderless software group. Software teams eschewed the archetypal politicking, incompetent middle manager – popularized by the comic strip Dilbert – and saw a better path. Teams could remove the manager entirely and focus on building really great systems together. Bureaucratic meetings are gone and team productivity bumped. I read a number of articles that heralded this new step.

Within this backdrop, I was studying consensus algorithms: how systems coordinate on a decision. I gave a talk to my engineering group on the Paxos algorithm. I remember even at the time having a difficult time describing the proposer – acceptor – learner flow for building consensus. Paxos solves the distributed consensus problem of consistent value agreement in a theoretically leaderless environment. It achieves this by supporting multiple proposers across many rounds. A proposer can propose a value; acceptors agree not to accept earlier numbered proposals. Once a majority accepts this proposal, the value is chosen. Later proposals will repeat the process, although a chosen value will remain.

Why? Paxos guarantees safety but does not guarantee liveness without coordination. That is, competing proposers can prevent progress. Even still, the process outlined only allows for agreement on a single value.

Real systems solve this by running consensus repeatedly to build an ordered log of values. Each value represents a command or transaction, and every node applies the log in the same order to derive identical system state. This replicated state machine model underpins many modern distributed databases and coordination systems.

Imagine a high-throughput system with millions of transactions per minute. Consensus coordination can dominate system latency.

Notice something. Consensus requires coordination. Coordination without a leader requires negotiation among many peers.

In practice, distributed systems consensus relies on elected-leader models that manage transaction ordering (e.g. Raft, ZAB, Multi-Paxos). The elected leader has a lease and retains that lease through heartbeats. In the event of a partition, followers stop receiving heartbeats and trigger a new election.

Distributed systems mirror human systems.

Notice what we just touched on:

  • Consensus without a leader (i.e. arbiter) impacts performance and can cause system deadlock
  • Leaders increase performance by reducing consensus work. Machines have less work to do to ensure consensus
  • The elected-leader model prevents the leader node from deadlocking the system

Shortly after the wave of “death to managers” articles followed a series of articles noting that the utopian vision so triumphantly declared had problems. Everything was great until … what if someone disagreed on the event architecture? What if one engineer was convinced a CQRS event model was ideally suited for the problem while another was convinced a stateful model was necessary to manage performance constraints. And both have valid points. And neither budge. Who decides? Here we have deadlock in a human system when trying to reach consensus.

Leaders help prevent deadlock and improve performance by making a decision.

Likewise, what about leaders who go absent, fail to be decisive, lack conviction? A new leader (i.e. manager) is elected. Leadership is leased, not owned.

Our methods of solving computational problems sometimes are a mirror to our methods of solving team problems. Effective leadership boosts operational efficiency; leadership itself is leased, not owned.

Leave a comment