Change Data Capture (CDC) vs. Business Event Pipelines

As more and more of software is shifting to rely on large amounts of data, and relying less and less solely on the request-response cycle, opting instead for event-driven architectures, choosing the right way to track changes and user actions in your system, and presenting users with the data and analytics they want, becomes more relevant.

Both CDC and business event pipelines can be built with service-oriented architectures in mind. They can both use a pub-sub model, allowing for a variety of consumers of the data they produce, but actions taken by users will produce data in different formats, depending on which you choose.

This is not a deep dive into how to implement either (I plan to write more in-depth about both in the future!), but rather a high-level look at what the two are and the differences between them to help you understand why a system might implement both, or to help you choose what is more relevant for your needs.

CDC captures changes made in a database, at the row level, and replicates them to another location, often another database or data store. This can either mean that each change is itself tracked so that the trail of how the data in one row got to be in the state that it is in is available, or it can just mean that a separate data store needs to be up to date with the primary one, so changes are replicated there, but any intermediate states are not tracked.

CDC pipelines can be useful for data replication, such as to a data warehouse, or for ETL (Extract, Transform, Load) jobs.

Because CDC events represent changes to rows in the database, they are typically generated by the database itself. The two most common ways are:

  • Read from the write-ahead log (WAL). This allows for changes to be captured asynchronously, while still providing low-latency.

  • Use database triggers to write changes to a specific table in the database built for storing these changes. This also provides a very low-latency solution, but happens synchronously, which may lead to some performance issues.

A business event represents a piece of logic that occurs within a system - we’ll dig in with an example shortly, but the changes that are produced by the occurrence of a business event are often not represented by a single row in the database, meaning that changes captured by a CDC pipeline cannot necessarily be cross-purposed to be used as a business event pipeline.

Business event pipelines are good for auditing user behavior, or to trigger other actions within a system based on user behavior. They may also be able to aggregate user actions that touch a variety of systems or services in a way that CDC pipelines cannot (unless the services share a database).

Business events will need to be produced at the application level, since the application stores business logic. As we’ll see in our example below, the database itself cannot know which user event caused any given change, so it cannot be responsible for tracking business events.

Let’s take a bank transaction as an example - a user transferring $100 from one account to another.

If we were using CDC, we might see:

  • Account A’s balance decreased by $100
  • Account B’s balance increased by $100
  • A Transfer record being created with a FK to each account A and B, and a value of $100

If we were tracking business events, we might see something like:

  • “User Z transferred $100 from Account A to Account B”

This demonstrates that in order to determine if a business event has occurred, we would need to identify multiple updated or inserted rows in the database, but also that a single change to a row in the database could indicate any number of business events, and without some additional logic, we wouldn’t necessarily be able to tell which it was. Was that increase of an account balance due to a deposit, interest being earned, or a transfer?