If you are working with data day in and day out, you probably have already come across the term “Change Data Capture” (CDC) somewhere. The concept is very important especially when the amount of data you need to process is growing. If your company is like a lot of others, your data is growing quite rapidly each day. And as your data grows, techniques like CDC become ever more crucial for handling your data inflow. Here at FlyData, we handle terabytes of data through CDC every day, and it is a core part of our FlyData suite. In this article, let’s walk through what CDC is, why it’s important, and what you need to be careful about when designing/implementing your own CDC solution.
The Need for BI Databases
First of all, why would you need something like CDC in the first place? Let’s say you launched your application and your active users are growing. Now, being a competent company, you’re collecting data that give actionable insights about your service is doing. You then analyze this data to find out how you can grow your business further. Well, the problem is that such analysis could involve long-running, taxing analytic queries. And you never want to run these against your master database because it would ruin your application performance. Hence you are faced with the need for a dedicated BI database, that has all of your application data ready for your analytics queries. Now, you have the challenge of making sure you always have all your data in your BI database. You need some kind of mechanism to replicate your data from your application database(s) to your BI database. How can we do this?
Option #1: Dumping Data
Probably the easiest solution (and many companies do so) is to export the entire database from the master database and import it to your BI database, say, every 24 hours. This works fine while your data size is small. However, it starts hitting limits pretty quickly as your user base grows. You’ll reach a point where you can’t export and import the entire database quickly enough for your business needs. In fact, a lot of companies regard 24 hours to be too long to wait in this fast-moving “Big Data” age.
Option #2: Utilizing CDC
This is where CDC comes to the rescue. Change Data Capture (CDC), as its name suggests, is a design pattern that captures individual data changes instead of dealing with the entire data. Instead of dumping your entire database, using CDC, you would capture just the data changes made to the master database and apply them to the BI databases to keep both of your databases in sync. This is much more scalable because it only deals with data changes. Also, the replication can be done much faster, often in near real-time.
How CDC-based Data Integration Works
CDC-based data integration consists of the following steps: 1. Capture change data 2. Transform change data into a format your destination database supports for upload 3. Upload the data to the destination database Let’s take a look at each of the steps.
Step 1. Capture Change Data
Most modern RDBMS (e.g., MySQL) offer two built-in ways of capturing change data: through (1) transaction logs, or (2) triggers. Alternatively, you can implement your own solution at the application level. Each method has its own pros and cons, so let’s cover each of them.
Transaction logs are used by RDBMS primarily for database replication. For example, when you create a read replica of your MySQL database, your master and slave databases are kept in sync via the transaction logs (the “binlog” for MySQL). It contains the history of changes made to rows (INSERT, UPDATE, and DELETEs) as well as data schema changes (DDL). The huge benefit of using transaction logs to capture change data is that it has minimal performance impact on the master database. It also requires no change to tables nor the application. The challenging part of using transaction logs is that transaction log formats are proprietary to each RDBMS. Also, many RDBMS lack sufficient documentation because they consider the transaction log formats to be an internal format. To make things worse, most, if not all, transaction log formats are in a binary format. These characteristics make the use of transaction logs quite challenging.
RDBMS allows you to define triggers, which are functions (hooks) automatically called right before or after certain events (e.g., “after insert” or “before update”). You can define triggers to capture changes and then generate a changelog of your own. A big advantage of this method is that, unlike transaction logs, all of this can be done at the SQL level. The downside is that using triggers could have a significant performance impact to the master database, as these triggers need to be run on the application database while the data changes are being made. Also, triggers need to be defined for each table. This can be an operational burden if you have many tables in your database that you want to replicate.
Building Your Own Solution
The third option is to implement your own CDC solution at the application level. One idea is to run a script that fetches all the rows that were changed since the last fetch. You might use the following fields to keep track of the capture status: updated_at – timestamp as of the last update revision – revision number incremented for each update captured – a boolean flag to indicate whether the row has been captured Building your own CDC solution can be a viable option depending on the nature of data changes to your tables. If changes to your database are only INSERTs, the job becomes much easier to tackle. On the other hand, if you have many UPDATEs, DELETEs, or table schema changes, writing your own CDC solution could become quite challenging, as the difficulty of keeping data integrity rises significantly. Unlike the other two approaches, this approach requires changes to the table schema as well as to the application itself.
Step 2. Transform Change Data
Captured change data needs to be transformed into a format that can be uploaded to your destination database. The format will depend on what destination database you use. This could be a series of SQL statements, or a data file you can batch upload, or a combination of the two. Depending on the destination database, there could be quite a few transformations you may need to make to your original data. The differences between your source and destination database could range anywhere from supported data types to character restrictions on table names, which you would need to replace as needed. The following is an example of differences between MySQL and Amazon Redshift, AWS’s data warehousing (DWH) solution.
- Case sensitive table/column names: MySQL – yes, Redshift – no
- TIME column value beyond 23:59:59: MySQL – yes, Redshift – no
- VARCHAR length: MySQL – character length, Redshift – byte length
- Date ‘0000-00-00’: MySQL – allowed, Redshift – not allowed
- BINARY, VARBINARY: MySQL – supported, Redshift – not supported
- ENUM, SET: MySQL – supported, Redshift – not supported
Step 3. Upload Change Data
If an RDBMS is your destination, data loading can be as simple as executing a series of SQL statements. If your destination is a BI database such as Amazon Redshift, on the other hand, running SQL may not be a feasible option from a performance standpoint. Instead, you will have to use special commands designed for data uploading such as the COPY command for Amazon Redshift. Such upload commands usually support only data INSERTs, so you’ll have to deal with UPDATEs and DELETEs separately. For example, UPDATEs to Amazon Redshift, a columnar database, can be done by using a technique called UPSERTs. In addition to capturing data changes, you’ll also need to deal with table schema changes such as adding a new column. This requires capturing DDL statements (e.g. ALTER TABLE), translating it to the destination database’s SQL statement, and executing the statement. Error handling is also crucial for CDC-based data integration because unlike the full export & import approach, failing to apply a data change can have subsequent negative impact to the integrity of your data. This is because CDC-based data integration relies on the assumption that the master and destination databases have the same set of data. We can write a whole new post on error handling, but basically your solution needs to be “aware” of any errors that occur. For example, let’s say one INSERT fails due to some data type error. Let’s also assume that the very next change was an UPDATE to that record that just failed to be inserted. If your solution is unaware of the initial error, then the next UPDATE change will error out as well. So your solution needs to recognize errors immediately as they happen, then buffer any subsequent change data, wait until the initial error is corrected, then automatically resume the syncing of all the subsequent change data. Imagine doing this when you have billions of records coming into hundreds of tables (and with the network instability you could get when doing this in the cloud)! Building such mechanisms could be quite a daunting task, but once built right, it can save the time necessary for full export/imports of your databases.
As you can see, CDC-based data integration is far more complex than the full export and import of your database. However, the benefit of having the whole data ready for analysis in near real-time is enormous and for many companies, looking into such solutions is worth the investment.
How FlyData Can Help
FlyData provides continuous, near real-time replication between RDS, MySQL and PostgreSQL databases to Amazon Redshift. The FlyData Sync tool is an intuitive, powerful, cost-effective way to integrate your transactional databases and analytical data stores in a single interface with no manual scripting.
You can start a 14-day Free Trial and begin syncing your data within minutes. For questions about FlyData and how we can help accelerate your use-case and journey on Amazon Redshift, connect with us at firstname.lastname@example.org.