Since its introduction in 2013, Amazon Redshift has been gaining ground in the Cloud-based data warehouse market. In this article, we’ll compare Amazon Redshift with other data warehouse solutions such as Hive/Hadoop, BI analytics tools, Vertica, MySQL to understand why Amazon Redshift is getting such traction. The article also covers different approaches to replicating MySQL to Amazon Redshift, including an approach using the MySQL binary log.
When it comes to BI analytics tools and data warehousing, there are so many to choose from. How do you know Amazon Redshift is the right choice for you? We’ll go over popular choices and see how Redshift stands out (or falls behind) compared to others.
When you run your web application, you may already have MySQL database collecting business data real time. Why don’t we just use it for analytics?
As you probably know already, this approach falls apart fairly quickly. Even running an average analytics query (e.g. total sales from the last Black Friday per demographic in comparison to the previous year’s) can stop the MySQL database and bring down your web application entirely.
Instead of running complex analytics queries against the master MySQL database, you could run those queries against a MySQL replication slave. Thus you can avoid complex queries from affecting the main web application. However, this approach will also have its limit.
Data will accumulate as business grows. Every bit of data is important for your business and removing data may not be an option. The issue with MySQL is that its query performance goes down significantly as the number of rows increases. MySQL is an OLTP (OnLine Transaction Processing) database optimized for handling massive number of small transactions. MySQL in particular lacks features required for analytics queries such as window functions.
Amazon Redshift on the other hand is designed to handle complex analytics queries extremely well thanks to its MPP (Massively Parallel Processing) architecture. If configured correctly, Redshift can run a query which aggregates billions of rows in seconds.
Amazon Redshift is also easier to scale than MySQL. You can increase Redshift’s disk space and computing power just by adding additional nodes. Redshift takes care of the heavy-lifting tasks such as data redistribution automatically.
It’s possible to scale MySQL using techniques such as sharding but it requires application-level changes.
BI Analytics Tools
There are variety of BI analytics tools available in the market; Birst, IBM, Information Builders, MicroStrategy, Oracle, Pentaho, SAP, Tableau, TIBCO and many more. Each has its own strength but all in all, they:
- Offer rich feature set to provide BI solution as a whole package targeting specific business domains - Provide lots of predefined and customizable reporting/visualization options
- Built primarily for enterprise business
- Expensive (both product and IT cost)
These tools are excellent choice if your organization has an established set of business processes and operations with well-defined business metrics, i.e. enterprise.
Startups and small to mid business tend to require ad-hoc, exploratory analytics (and of course their budget is limited.) For them BI analytics tools can be overkill and too expensive.
Amazon Redshift is ideal for startups and small/mid business. Redshift supports PostgreSQL compatible SQL so that data engineers can run analytics queries without product-proprietary training. Amazon Redshift is also a choice for large enterprise companies due to its massive scalability and flexibility via PostgreSQL compatible SQL syntax.
Some BI analytics tools (IBM, MicroStrategy, Pentaho, Tableau, TIBCO, etc) work with Amazon Redshift as a data source. Companies often use those tools as a frontend to Redshift.
Hive/Hadoop or Spark
Apache Hadoop has been a favorite technology of the Big Data era for quite some time. Apache Spark has been gaining popularity quickly as a replacement of or a compliment to Hadoop. Both Hadoop and Spark are, in essence, a distributed computing framework. Both can do data warehouse tasks similar to Amazon Redshift, but it’s only one of applications of Hadoop/Spark. Hive is a SQL interface on top of Hadoop with which a user can access Hadoop cluster with SQL-like language. For the purpose of comparison, Hive/Hadoop is more comparable to Amazon Redshift. Comparing Hive/Hadoop and Redshift, Redshift has a clear edge in terms of performance (Hive and Amazon Redshift: A Brief Comparison) Depending on conditions, Redshift is reported to be 6-20 times faster than a comparable Hive setup. The performance gap got smaller with Spark and its SQL engines such as Shark.
But according to a benchmark by AMPLab of U.C. Berkeley, Redshift is still twice as fast as on-memory Shark/Spark. Hadoop and Spark are offered as a framework library. In order to run them, the framework needs to be deployed on servers which again need to be deployed on a data center. Amazon Redshift again has a clear edge on this account because it is a fully managed service. A number of Hadoop-as-a-service solutions are available from various players including Amazon’s own Elastic MapReduce (EMR.) Using these services will reduce the personnel cost significantly. However, the learning-curve to master Hadoop or Spark still remains, which should be considered as a cost of its general-purposeness.
Amazon Redshift is a data warehouse and designed to be so. This explains why Amazon Redshift has such distinctive price/performance advantage over Hadoop/Spark.
Vertica, Greenplum, Netezza, Teradata
These are all popular data warehouse products that Amazon Redshift competes more head-to-head than the other alternatives we have seen so far. All these products offer an on-premise solution. If your data sits on-premise and cannot be moved out, for security or bandwidth reasons, Redshift may not be your solution as it’s a Cloud-only service. These data warehouse products offer more analytics functions than Redshift as well as finer workload management and more concurrent queries than Redshift, though, Redshift has been catching up in these areas.
For example, Redshift increased the number of concurrent queries from 15 to 50 in April 2014. If hundreds of data analysts run concurrent queries on a single data warehouse, a traditional data warehouse product, especially Teradata, may be a better choice. Performance-wise, Redshift beats Vertica on a comparable hardware setup according to a benchmark by CAKE & Accelerize. Although I was unable to obtain a reliable third-party benchmark result between Redshift and Teradata, Teradata appears to perform better, especially with a massive amount of data.
Cost is where Redshift beats other products hands down. Redshift costs as low as $935/TB/year (3yr contract) while others cost significantly higher (Vertica $10,000/TB, Greenplum $30,000/TB, Teradata $66,000/TB). The cost difference is hardly justifiable unless you have a very good reason.
How do you pipe data into Amazon Redshift?
Redshift, or any data warehouse, is useless without up-to-date data. Therefore it is crucial to plan and implement a data integration strategy that fits to your needs. This section of the article will go over different data integration strategies to help you choose the best strategy.
Export & Import
The simplest way to copy data from MySQL and Redshift is to export the entire dataset from MySQL and import into Redshift. You can run the same steps every time you want the latest data from MySQL. Export & Import consists of the following steps.
You can export data using MySQL’s mysqldump command. A typical mysqldump command looks like this:
$ mysqldump -h yourmysqlhost -u user mydatabase mytable1 mytable2 --result-file dump.sql
The output is MySQL SQL statements. You cannot run the SQL on Amazon Redshift as is, so you’ll have to transform the contents to a format suitable for Redshift import.
Being optimized as data warehouse, Redshift’s data manipulation performance (INSERT, UPDATE,DELETE) is suboptimal. To load data effectively, Redshift offers the COPY command. COPY command batch uploads files in TSV (tab separated values) format stored in Amazon S3 in a Redshift table. A row of data in MySQL dump looks like this:
mysql> INSERT INTO `users` (`id`, `firstname`, `lastname`, `age`) VALUES (1923, ‘John’, ‘Smith’, 34),(1925,’Tommy’,’King’);
This needs to be converted to TSV as follows:
Note that values are separated by a tab character (\t)
Not only a format conversion but you may also have to convert values to be Redshift compatible because MySQL and Redshift support different column types. Even if they support the same data types, acceptable values may be different between MySQL and Redshift. For example, ‘0000-00-00’ is a valid DATE value in MySQL but Redshift will throw an error when the value is uploaded. You will have to convert the value to something else, Redshift accepts ‘0001-01-01’, for example. This article shows how FlyData handles such differences.
Once you have data in TSV files, the last step is to import the data in Redshift. Steps are: Upload TSV files to Amazon S3 Run COPY command COPY command looks like this:
COPY users FROM 's3://my_s3_bucket/unload-folder/users_' credentials 'aws_access_key_id=your_access_key;aws_secret_access_key=your_secret_key';
Although Export & Import is the simplest way to copy data from MySQL and Redshift, it has a limit in update frequency. For example, 18GB of data in MySQL takes about 30 minutes to export over a 100Mbps network. It takes another 30 minutes to import to Redshift. You can update data only as often as every hour. In real life, network can disconnect during export or import, in which case you will need to restart the process from the beginning, further delaying the process. If you want shorter data refresh frequency or more reliable updates, you need to seek for another approach.
Incremental SELECT & COPY
If your MySQL dataset is too big to do export & import, the second easiest approach is an incremental SELECT & COPY. Instead of exporting & importing the entire dataset every time, it updates only those records changed since the last update. This reduces the number of records significantly so that you can update data more frequently. To implement incremental SELECT & COPY, your MySQL table has to meet a couple of conditions. Table must have an updated_at column, whose timestamp gets updated every time the row is changed.
Table must have unique key(s)
Unlike Export&Import, Increment SELECT exports only changed row since the last update. The SELECT query you run on MySQL looks like this:
SELECT * FROM users WHERE updated_at >= ‘2016-08-12 20:00:00’;
Save the result to a file for transformation.
Transformation step is the same as Export&Import. Transform the MySQL data into TSV format for Redshift COPY command.
The data from MySQL includes both rows newly inserted after the last snapshot and rows updated since the last snapshot. You cannot run a COPY command straight to the destination Redshift table because updated rows will be duplicated. To avoid duplicate rows, you can use the following steps. We call the technique DELSERT (DELete + inSERT) Create a temporary table on Redshift with the same definition as the destination table Run COPY command to upload data to the temporary table Delete rows from the destination table which also exist in the temporary table.
DELETE FROM users USING users_staging s WHERE users.id = s.id;
“id” is the unique key of the table. Insert rows from the temporary table to the destination table.
INSERT INTO users (id, firstname, lastname, updated_at) SELECT id, firstname, lastname, updated_at FROM users_staging s;
This approach is more efficient than the Export&Import but it has a couple of drawbacks. With this method, rows deleted on MySQL stay on Redshift indefinitely. This may be a prefered behavior if you want to keep historical data on Redshift while purging old data on MySQL for performance and cost reasons.
Second drawback is that it does not handle table schema change well. When a column is added or dropped from the MySQL table. You need to make the corresponding change on the Redshift table manually.
Another drawback is that the query to get updated rows from a MySQL table can affect the performance of the MySQL database. If any of these drawbacks is a deal breaker, you might need to consider Change Data Capture (CDC)
Change Data Capture with Binlog
Change Data Capture (CDC) is a technique to capture changes made to the source (MySQL) table and apply to the destination (Redshift.) MySQL enables CDC using the Binary Log (binlog). Binlog is a binary change log. It was originally developed to implement replication but it is also used as a CDC outlet to replicate data to a different kind of database.
Binlog not only captures data changes (INSERT, UPDATE, DELETE) but also captures table schema changes such as ADD/DROP COLUMN. By replaying those schema changes on Amazon Redshift, you can achieve true replication between MySQL and Redshift. Unlike Incremental SELECT©, rows deleted on MySQL can be deleted from Redshift as well if you use binlog.
Binlog also lets you capture change data as a stream so it’s possible to make replication near real time. Also, binlog does not affect the performance of the MySQL database as SELECT query does.
What is Binlog?
As mentioned earlier, binlog is used to implement MySQL replication. When the binlog is enabled, MySQL server logs change events into a local file. This file is called binlog file. MySQL server keeps multiple binlog files as data grows.
You can access binlog data via MySQL protocol. When you replicate MySQL data to Amazon Redshift using binlog, you will be writing an application which reads streaming binlog data from MySQL server, transform and import to Redshift.
How to set up data replication using Binlog
There is an open source library called mysql-replication-listener. This C++ library offers streaming API to read data from MySQL binlog in real time. High level API is also available for a couple of languages.
To use binlog for MySQL replication to Amazon Redshift, you need to set the MySQL config parameters to enable binlog. The following is a list of parameters related to binlog:
log_bin = /file_path/mysql-bin.log
Parameter binlog_format sets the format how binlog events are stored in the binlog file. There are 3 supported formats; STATEMENT, MIXED and ROW.
STATEMENT format saves queries in the binlog files as is (e.g.
UPDATE SET firstname=’Tom’ WHERE id=293;) Although it saves binlog file size, it has issues when used for replication. For replication to Redshift, use ROW format.
ROW format saves changed values in the binlog files. It increases the binlog file size but ensures data consistency between MySQL and Amazon Redshift. log_bin sets the path where binlog files are stored. expire_logs_days determines how many days binlog files are kept.
It’s recommended to keep binlog files for a couple of days because it gives you time to address an issue and resume replication in case it stops due to an error. Specify tables you’d like to replicate in the replicate-wild-do-table parameter. Only those tables specified here goes in to the binlog files. If you use a MySQL replication slave server as the source, it’s important to specify the log-slave-updates to TRUE. Otherwise, data changes made on the replication master will not be logged in the binlog. The MySQL account to access binlog needs to have the following privileges in order to perform replication related tasks.
- REPLICATION SLAVE
- REPLICATION CLIENT
- LOCK TABLES
2.Export & Transformation
If you use binlog, export is real time data streaming from MySQL binlog files. Taking kodama as an example, binlog data is delivered as a stream of binlog events. Kodama lets you register event handlers for different event types (insert, update, delete, alter table, create table, etc) Your application will receive binlog events and generate output ready for Redshift import (for data changes) or schema change (for table schema changes) The former is similar to the Transformation step of the other data integration approach. Unlike the others, however, binlog allows you to handle delete events so you need to handle them specifically. This blog article (How to Improve Amazon Redshift Upload Performance) describes how to handle delete events.
Amazon Redshift does not provide streaming upload functionality. For efficient data upload, we have to take the same batch upload approach as we used in the other data integration approaches.
Although replication using Binlog is the ideal way to replicate data between MySQL and Amazon Redshift, it will require a serious application development. In addition to the main data processing flow described above, your application will have to deal with the nitty-gritty details to make the application production-ready.
- Transaction management. Network or sometimes even MySQL process can go down while your application is reading Binlog data. Your application needs to be able to resume from the last binlog position where the problem occurred.
- Data buffering and retry. Similarly, Redshift can become unavailable while your application is sending data. Your application needs to buffer unsent data until Redshift cluster comes back. Often times, seemingly straightforward things such as knowing whether a chunk of data has been uploaded to Redshift or not can be difficult. If you make the wrong decision, it can cause either data loss or duplicate data.
- Table schema change support. A table schema change binlog event (ALTER/ADD/DROP TABLE) comes as a native MySQL SQL statement which does not run on Redshift as is. To support table schema changes, you’ll have to convert MySQL statements to the corresponding Amazon Redshift statements.
Amazon Redshift brought game changing price performance to the data warehouse market. To fully utilize its potential, replicating data between MySQL and Redshift as real-time as possible is necessary. There are a couple of data integration strategies each of which has its own benefits and drawbacks. The binlog approach is the ideal solution, but its development and operation cost may be prohibitive. Unless data integration is the core of your business, it may make business sense to use a data integration service like FlyData and free up your precious engineering resource for the core BI development.