Thinking & Building Distributed First

When many of us are first taught to program, we’re told something along the lines of “programming is a set of very specific instructions that a computer follows in a specific order.” And this is true, sort of. It’s true, but it’s incomplete. What that statement ignores is that when we’re programming for the web, we’re usually dealing with more than one computer.

Multiple computers run our application servers. We use microservices. We have databases and background tasks and message busses and scheduled jobs.

These are the high-level things I wish someone had told me when I first graduated from my bootcamp. They’re questions to think about as we’re writing code within a system. I don’t have answers to all of these questions here - they’re highly context-dependent. But it’s important to consider these things early on in the development process - otherwise we might end up with a system that works delightfully on our local machines, but not at all in production.

The target audience here is folks who have at least some understanding of how web applications work and how to build them, but haven’t spent extensive amount of time working on highly-available distributed systems in production.

If there are multiple side-effects of a given action, does the order in which they occur matter? Order is not guaranteed in many distributed systems:

  • If you have multiple servers processing background tasks, they may be processed in a different order than they are produced.
  • Similarly, in partitioned pub-sub systems, messages may be consumed in a different order than they were produced, or consumed by multiple different consumers in a different order each time.
  • If you make multiple asynchronous requests, they might finish in a different order than they were initiated.

If order does matter, you’ll need to make sure it’s preserved, by putting all related actions in a single background task, using partition keys to make sure each message ends up on the same partition of your message bus, making your requests synchronous, or another mechanism appropriate to your system.

When you have a distributed system, the pieces of the system typically talk to each other via network requests. The most common format for data in a network request is JSON. The data types that are JSON serializable are integers, strings, booleans, lists/arrays, dictionaries/hashes, and null. That’s it.

There are plenty of other data types that are commonly used in our code, and we often pass them around as arguments to functions. This won’t work if you’re sending the data off to another part of the system. I wrote previously about this problem, specifically in the context of background tasks, here.

When deploying changes in a distributed system, deployment typically either happens only to one part of the system, or isn’t guaranteed to finish at the exact same time. This means that the APIs for our microservices need to work on both sides of the deployment. It means that when we make schema changes to our database, we need to have a mechanism for the code to work with the database in either the “before” or the “after” state. It means that if the signature of a background task is changed, it still needs to be able to process messages that were enqueued before the deployment.

Most distributed systems have very low latency and high reliability. But eventually, something will go wrong, and it’s important to know what will happen when it does. Some of this goes back to the ordering question. If latency is high in one part of the system, other parts may chug on ahead as usual, while that one falls behind. But there are other things to consider as well. Will your customers notice? If they do, is that okay? What other downstream systems will be impacted? Will data be lost?

And sometimes just as importantly - what will happen when your system comes back up? Is the process automated? Will there be a huge spike in load as a backlog of data is processed?

Tracing and observability are more challenging in distributed systems. If there are multiple pieces of a system involved in something happening, and thing doesn’t happen, are you going to know why? Which part of the system failed? Did the record fail to get written to the database? Was there a server error on one of the application servers? Are the servers processing background tasks down? Is a third-party service provider down? And so on.

Consider things like what logging, alerting, and observability tools you need to be able to track down an unexpected results.