Distributed Systems: an introduction to Publish-Subscribe (pub/sub)

More and more of the web is moving to microservice architecture, which allows for loosely-coupled services to work together to provide functionality to users. While these services will often communicate with each other via network requests, the publish-subscribe model is another common way for this collaboration to occur.

There are many different technologies that can be used to implement pub-sub, including but not limited to Apache Kafka, Amazon Kinesis, Google Pub/Sub, and Microsoft Event Hubs. The purpose of this post is not to go into the details of implementation or discuss the pros and cons of these specific technologies, but to provide definitions of the words and an overview of how it works.

I talk about two different types of events that can be passed through a pub-sub model, Change Data Capture events and “business events” here, though these messages could of course be anything!

Before we dive into pub-sub, let’s take a quick detour to decoupling. If we’re not convinced of the value of decoupling, pub-sub won’t do us any good!

Decoupling means reducing the dependency between pieces of a system. This can mean two classes in the same codebase or two services in a system built on microservices, or anything in between.

Ok, but what do we mean by “reducing the dependency”? Obviously the different parts of the system rely on each other - they’re part of the same system! Typically what people mean when they talk about the level of dependency between systems is whether or not one part of the system needs to be changed when the other does, or whether they can be updated independently of the other.

Pub-sub is decoupled because using it, we can change anything we want about a publisher - we could even go so far as to rebuild it using an entirely different language or framework! - but as long as it continues to produce messages in the same way, we don’t need to change anything at all about the downstream subscriber. Same goes in reverse - as long as a subscriber continues to be able to process messages that come to it in the same format, we can change anything we want about it and the publisher doesn’t need to be aware of this change at all.

This will make more sense once we dive into this a bit more, so let’s do that!

Diagram showing two publishers producing data to a single message bus, with three subscribers consuming data from the message busTwo publishers producing data to a single message bus, with three subscribers consuming data from the message bus

Below are definitions of some of the terminology commonly used in pub-sub offerings, which will provide APIs for doing all of this work, and an explanation of how the pieces work together.

Each message bus, the servers where messages are sent to, stored, and pulled from, can have one or more topics on it. Topics are typically used to separate different types of messages from each other, since not all subscribers will need to see all messages - they usually only need to operate on a pre-defined subset of messages. If all subscribers saw every message, they would waste a lot of time determining whether they needed to do anything with a given message, and just discarding a huge number without taking any action - this would be a waste of resources!

This also sometimes referred to as “producing.” A publisher is anything that puts messages onto the message bus. This can be a database trigger, a web request, an API call, running a script or chron job, whatever. Messages typically take a pre-defined format, and are in JSON. They are pushed to a specific topic.

This also sometimes referred to as “consuming.” Subscribers are typically responsible for a specific piece of behavior, and will subscribe to a specific topic. The beauty of pub-sub is that you can have as many subscribers as you want, which allows you to scale them all separately, depending on how long each message takes a given subscriber to process, and how much traffic is on the given topic to which they subscribe.

Topics can be further broken down into partitions. How you partition your data can be important, as ordering is not guaranteed across partitions. If the order in which messages are processed is important, all messages that need to be processed in order, will need to be on the same partition. For example, you may partition messages based on a user_id, if all messages for a given user need to be processed in order, but order doesn’t matter across different users.

The number of partitions you choose can be an important early decision, since it’s difficult to change later (it would cause downtime and the possibility of losing ordering). You also cannot ever have more workers on a given subscriber than you have partitions, since messages on partitions are never divided (that would result in losing ordering).

You might be looking at that diagram above and thinking that it looks an awful lot like diagrams you’ve seen of background task messaging systems. And you’re right - they do have some similarities! The primary difference between background task queues and pub-sub systems is the lifecycle of a given message or event.

With background tasks, messages can be pushed to the queue from any number of places, just like with pub-sub. And a background task queue can have any number of workers processing messages from that queue, just like with pub-sub. However, with background queues, once a message has been processed by any one of the workers, that message is removed from the queue and not processed any further.

With pub-sub, each subscriber group (cluster of workers) is responsible for a discrete piece of logic, so each group must process each message. Pub-sub message buses also often retain data for some period of time, instead of discarding messages from the queue as soon as they’re processed. This allows events to be replayed, should something have gone wrong the first time a subscriber processed them, or even if a new subscriber needs to be added after the fact!

This was a high level overview, but hopefully it gave you enough context to keep diving in from here - pub-sub is a powerful software pattern! As with all distributed systems, things are often a lot more complex than they seem at first glance. They need to be fault-tolerant, they need to handle high throughput and low-latency, and they need to be configurable. While most services with pub-sub offerings will provide defaults that will get you up and running, and you can use them without knowing much about the internals, diving deeper can always provide more context and help you make more informed decisions about how to use these services.