Today, I deep-dived into the CDC, understanding how it works, how it captures changes in the Database, and how DiceDb’s reactivity differs from CDC and its usecases.
What is CDC?
In today's data-driven world, organizations need real-time insights to make quick decisions and maintain competitive advantages. Two powerful approaches have emerged to handle data changes: Change Data Capture (CDC) and reactive databases like DiceDB. While both address the challenge of keeping systems synchronized with data changes, they employ fundamentally different architectures. This blog explores the intricacies of both approaches, their mechanisms, and their respective use cases.
How does CDC work and its types?
CDC basically keeps track of current and new Data using various ways and sends an event whenever the user specificed source is modified. There are three main types:
Log Based CDC
This is considered the most efficient implementation method, where the transaction log of the source database is continuously monitored for new entries. This approach reads database transaction logs (such as Postgres' WAL or MySQL's binary log) to capture changes without impacting the source system's performance. More on WAL here.
Advantages:
Minimal impact on source database performance
Provides complete change history
Disadvantages:
Complex to implement due to proprietary log formats
Requires parsing of internal database structures
Trigger Based CDC
The name clears up most of the things. This type of CDC uses Database triggers to keep track of the rows when changes occur, creating change logs in shadow tables. Triggers are set to fire before or after INSERT, UPDATE, or DELETE operations
Advantages:
Real-time change capture
Disadvantages:
Significant performance impact on source database
Requires trigger maintenance as applications evolve
Can strain system resources
Timestamp Based CDC
This actually polls data from the Database records to check the updated timestamps. Stores current timestamps and polls whenever the timestamp is greater than the last one, it pushes the new data from the row in an Event
Advantages:
Simpler to Implement
Disadvantages:
Not Realtime. The time in between polls can delay the Event
Usecases
After reading all the types and what CDC is, You might wonder why not emit events directly to Kafka when data changes, skipping the need for a CDC pipeline altogether? In many cases, that works, especially when your application logic governs data flow. However, CDC shines in scenarios where data modifications occur outside your application layer or when you need to replicate existing systems. Here are some concrete use cases:
Database Synchronization and Replication: CDC enables continuous replication between databases, ensuring target systems stay synchronized with source systems without full data refreshes.
Data Warehouse Synchronization: When organizations need to keep analytical systems synchronized with operational databases
Real-Time Analytics: Orgs use CDC to feed data warehouses and data lakes with up-to-date information for real-time business intelligence.
Cloud Migration: CDC facilitates zero-downtime migrations by keeping source and target systems synchronized during transition periods. Airbyte supports this where you can migrate from a MySQL to Postgres without much Effort.
Event-Driven Architectures: CDC provides the foundation for event-driven systems by capturing and propagating data changes as events
Cache Invalidation: Uber uses something called Flux CDC, which captures changes made to the MySQL database and replicates them to Redis. When a change occurs, the corresponding cache entry is invalidated to ensure data consistency.
Reactivity:
Here comes DiceDB by Arpit Bhayani. By the way, I am a contributor at DiceDB(PR Link). DiceDB brings in the awesome concept of Reactivity which means its a push-based Database which pushes events whenever, the client connected to it uses the WATCH command. Unlike conventional databases where clients must query for data, DiceDB proactively pushes updated query results to clients as soon as underlying data changes.
CDCs vs Reactive DBs:
While both CDC and DiceDB address data change management, they employ fundamentally different approaches:
Architecture Philosophy
CDC: Operates as a middleware layer between source and target systems, focusing on data replication and synchronization. CDC captures changes and delivers them to downstream systems, maintaining separation between data producers and consumers.
DiceDB: Functions as a reactive database that eliminates the distinction between data storage and change notification. It integrates reactivity directly into the database engine, providing immediate result set updates to subscribers.
Change Notification Approach
CDC: Provides change events or deltas, notifying subscribers about what changed but requiring them to process these changes to understand current state.
DiceDB: Delivers complete, updated result sets rather than just change notifications. When subscribed data changes, clients receive the new query results immediately, not just information about what changed.
Implementation Complexity
CDC: Requires additional infrastructure for change capture, processing, and delivery. Organizations must implement and maintain CDC pipelines, often involving multiple tools and components.
DiceDB: Provides built-in reactivity through simple command subscriptions. The complexity is handled internally by the database engine, simplifying application development.
Data Processing
CDC: Follows a "capture-and-forward" model where changes are detected, captured, and then sent to interested parties.
DiceDB: Implements a "subscribe-and-receive" model where clients express interest in specific queries and automatically receive updated results
Future Plans:
I will be going through an opensource CDC’s codebase and will be trying to implement a scrappy CDC, mostly a log based on my own, which will work on top of SQL Databases.