This article is written for explaining what exactly is Amazon Redshift and what is not. Hope it helps those who are curious about Amazon Redshift.

Redshift is a Disruptive Price Alternative in DWH

A data warehouse is a system for analyzing and managing data, a database for big data where data is mostly just being added without much updating. The difference between ordinary database systems like MySQL and a data warehouse is that a data warehouse can save a huge volume of past or archived data and is good at online analytical processing (OLAP) such as data aggregation. A data warehouse is unable to perform online transaction processing (OLTP) in milliseconds nor is it ideal for handling frequent updates. Data warehousing is used with Business Intelligence (BI) tools that help with enterprise decision-making, and for optimizing systems based on historical data. Data warehousing has been considered a costly venture, ranging anywhere from hundreds of thousands of dollars to a few million per year. The price is often high because data warehousing is used for a variety of purposes and requires specialized hardware. However, because of recent advances in low-cost, big data processing technology, smaller organizations such as advertising and social gaming companies are using big data analysis to create new business. Big data analysis and data warehousing often go hand-in-hand but are also seen as two separate services because of major price differences. Redshift has managed to make it possible for even start-ups and other small-scale businesses to make use of both without any major increase in price. This is why it is being called disruptive. Although the same kind of processing can be done on Hadoop Hive at a comparatively low cost, there are still substantial costs when it comes to securing a Hadoop specialist or managing a large scale server setup. It remains very hard for small businesses to enjoy the benefits of big data analysis with any degree of ease or sufficiency.

Specific Points

So what exactly are the differences between Redshift and existing data warehouse tools on Hadoop? Price Try using the following key words in a search. Even for terabyte-scale data warehousing, price tags are in the millions. Multiple DBAs also have to be brought on board to run the system. Amazon will provide you with the same level of service for as low as US$1,000 a year per terabyte (three year reserved instance, minimum 2TB). Database management itself takes place in the Amazon cloud, which is included in the cost. Plainly put, it’s as simple as comparing $1 million to $1,000 (that’s three whole zeroes!). In fact I think it would be hard to find a rationale that would explain this gap in price. That is why there has been so much noise about Redshift. At FlyData, we ran benchmarks for the same queries on Hadoop (using Hive and the same SQL query), and the results we published show that Redshift is more than 10x superior in both speed and cost, even when comparing just normal execution cost. At one point, our benchmark was the Hottest Slide on SlideShare, which just shows the high level of interest that Amazon Redshift has sparked amongst big data specialists. Redshift Hadoop1 Redshift Hadoop2 Features as Data Warehouse First and foremost, the greatest feature of Redshift is its columnar storage technology. As its name suggests, a column-oriented data structure stores data by columns, as opposed to traditional row-oriented databases. When indexes are used, traditional databases can be great at retrieving data from a specific row in an instant, but they aren’t as good for instance aggregate functions for all records, because extra data has to be processed. Columnar databases are ideally suited for aggregate processing where only a few specific columns need to be worked on. Plus, compression efficiency is much greater because it is common for specific data to be repeated throughout an individual column. Redshift chooses from among seven different compression algorithms (as of February 28) for each column. It also gives you ways to detect the optimal compression method for data that is going to be loaded or for already saved data. Even with big data, results are obtained with remarkable speed because only the compressed columns necessary for a query are retrieved. Hadoop has supplemental modules that exploit the merits of columnar orientation (RCFile, HBase, etc.) but they present some technical barriers. In my experience, these modules are not often used and require more in-depth knowledge of Hadoop. But the truly stand-out feature of Redshift when compared with other data warehouses is its scalability, as might be expected from Amazon. With Massively Parallel Processing (MPP), as data volume grows, processing and storage nodes can be added in order to preserve or elevate processing speed. We verified this function at FlyData, running benchmarks for data loading and query speed. Redshift1 Redshift2 Redshift actually grew out of technology originally developed by ParAccel, which Amazon has since put money into. MPP is a function still offered by ParAccel and other vendors but it is significant that MPP is now available from AWS. This is the greatest feature unique to Amazon, namely that scale-out (increasing the number of nodes) can be accomplished in a matter of clicks! I think this is going to be a unique part of this public cloud-based service and a very hard act to follow by any other vendor. In summary, Redshift is a columnar database with good compression options and flexible scalability via MPP with scale-out in just a few clicks. When I first heard about these features and then after we successfully verified them, I actually got goosebumps. Compatibility Redshift is built on PostgreSQL database technology. That means almost all postgreSQL drivers, JDBC drivers, and compatible tools can be used. In short, once you load data onto Redshift, you’re able to easily utilize existing applications and newly created web services! Of course, this doesn’t mean all PostgreSQL functions can be used (for example, only primitive data types are supported), but we can assume that the barrier to entry is significantly lower than Hadoop’s. Current Issues The biggest point for concern is that Redshift has only just been opened so there is not much available about performance yet. But this technology has already been utilized and tested from the start by the world’s number one e-commerce site, namely Amazon, and was also being tested by a variety of enterprises significantly prior to wide release. There is bound to be a number of issues after release since it’s going to be used widely, but these should settle down fairly quickly. In the benchmarks referred to earlier, we found that it took a long time to load big batches of data onto Redshift, but we also found that by increasing the number of nodes, we achieved a significant increase in performance. FlyData is also working to develop a continuous data upload mechanism to avoid delays. Through articles like this one, FlyData aims to support anyone who chooses to use Redshift, as its technical content and ways that it can be used are not widely known yet.

Segregation with Hadoop

Naturally, there are many areas in which Hadoop excels over Redshift. There are examples of advanced processing that can only be done on Hadoop such as processing the whole database record by record and analytics that use complex machine learning. Hadoop may be superior in terms of cost performance for processing data that is not analyzed frequently (annual or monthly petabyte scale batch processing). On the other hand, for processing data close to real time with a short turnaround on constantly updated data (minutes old), Redshift may show big advantages over Hadoop and other data warehousing systems. Such applications are common in advertising technology, digital marketing, and social gaming analysis to name a few.

Redshift is the Killer App for Big Data Analysis

Till now we at FlyData have proceeded under the belief that the shortest path to bestowing the benefits of big data to companies of any size lie in figuring out how to make it cheaper to use Hadoop. We held the belief that Hadoop was the only possibility for the general use of big data. This viewpoint has been completely overturned by Redshift. This is a most auspicious thing. Why? Because this means that access to big data analytics is now open to any size company, providing big data that is easy to use and a low price point. At FlyData, we want to help make Redshift easy to use so everyone can leverage big data in the cloud to open up the path to a better world.

Redshift is a Disruptive Price Alternative in DWH

A data warehouse is a system for analyzing and managing data, a database for big data where data is mostly just being added without much updating. The difference between ordinary database systems like MySQL and a data warehouse is that a data warehouse can save a huge volume of past or archived data and is good at online analytical processing (OLAP) such as data aggregation. A data warehouse is unable to perform online transaction processing (OLTP) in milliseconds nor is it ideal for handling frequent updates. Data warehousing is used with Business Intelligence (BI) tools that help with enterprise decision-making, and for optimizing systems based on historical data. Data warehousing has come to be seen as a very costly venture, ranging from hundreds of thousands of dollars to more a few million dollars per year. The price is often high because the different uses of the database mean that specialized hardware is often part of the package and because data warehousing has been used primarily by big businesses for data analytics. But the cost of big data analysis is now being reevaluated with the recent advances in low cost big data processing technology like Hadoop; we are starting to see smaller organizations like advertising technology and social gaming companies use big data analysis to create new business. Though big data analysis and data warehousing are often seen as completely separate things because of their price difference, Redshift has smashed through that price difference and has made it possible for even start-ups and other small scale businesses to make use of big data analysis. This is why it is being called disruptive. Already the same kind of processing can be done on Hadoop (using Hadoop Hive for example) at a comparatively low cost, but when it comes to a business actually committing to Hadoop, there are considerable costs such as securing capable Hadoop specialists plus managing a large scale server set-up. It remains very hard for small businesses to enjoy the benefits of big data analysis with any degree of ease or sufficiency.

Specific Points

So what exactly are the differences between Redshift and existing data warehouse tools on Hadoop? Price Try using the following key words in a search. Even for terabyte-scale data warehousing, price tags are in the millions. Multiple DBAs also have to be brought on board to run the system. Amazon will provide you with the same level of service for as low as US$1,000 a year per terabyte (three year reserved instance, minimum 2TB). Database management itself takes place in the Amazon cloud, which is included in the cost. Plainly put, it is comparing something in the US$1 million range versus US$1,000. That’s three whole zeros. In fact I think it would be hard to find a rationale that would explain this gap in price. That is why there has been so much noise about Redshift. At FlyData, we ran benchmarks for the same queries on Hadoop (using Hive and the same SQL query), and the results we published show that Redshift is more than 10x superior in both speed and cost, even when comparing just normal execution cost. Our benchmark presentation was the Hottest Slide on SlideShare for a time. This just shows the high level of interest Amazon Redshift has sparked among big data specialists. With Hadoop and Hive, it is necessary to have big data specialists working for you but Redshift can be operated by an ordinary Web engineer as will be seen later. Redshift Hadoop1 Redshift Hadoop2 Features as Data Warehouse This is going to get a little technical but the greatest feature of Redshift is what is called columnar data storage. A column-oriented data structure is a way to store data by columns; it is being used in IBM’s Netezza and HP’s Vertica, which are some of the newer data warehousing tools. In contrast, normal databases are known as row-oriented. When indexes are used, these databases can be great at retrieving data from a specific row in an instant, but they are not good at for instance aggregate functions for all records because all of the data has to be processed. Columnar databases are ideally suited for aggregate processing where only specific columns need to be worked on. Plus, compression efficiency is much greater because it is common for specific data to be repeated throughout an individual column. Redshift chooses from among seven different compression algorithms (as of February 28) for each column. It also gives you ways to detect the optimal compression method for data that is going to be loaded or for already saved data. So results are obtained with remarkable speed, even when working with big data, because only the compressed columns necessary to satisfy a query are retrieved. Hadoop has supplemental modules that exploit the merits of columnar orientation (RCFile, HBase, etc.) but they present some technical barriers. In my experience, these modules are not often used and their use would require more in-depth knowledge of Hadoop. But the truly stand-out feature of Redshift when compared with other data warehouses is its scalability, as might be expected from Amazon. With Massively Parallel Processing (MPP), as data volume grows, processing and storage nodes can be added in order to preserve or elevate processing speed. We verified this function at FlyData, running benchmarks for data loading and query speed. Redshift1 Redshift2 Redshift actually grew out of technology originally developed by ParAccel, which Amazon has since put money into. MPP is a function still offered by ParAccel and other vendors but it is significant that MPP is now available from AWS. This is the greatest feature unique to Amazon, namely that scale-out (increasing the number of nodes) can be accomplished in a matter of clicks! I think this is going to be a unique part of this public cloud-based service and a very hard act to follow by any other vendor. In summary, Redshift is a columnar database with good compression options and flexible scalability via MPP with scale-out in just a few clicks. When I first heard about these features and then after we successfully verified them, I actually got goose bumps. Compatibility Redshift is built on the PostgreSQL database technology. That means almost all of the PostgreSQL drivers and compatible tools can be used. All normal JDBC driver operations can be used. In short, once you load data onto Redshift, you are easily able to utilize existing applications and newly created web services. Of course, this doesn’t mean all PostgreSQL functions can be used (for example only primitive data types are supported), but we can assume that the barrier to entry compared to Hadoop is markedly lower. Current Issues The biggest point for concern is that Redshift has only just been opened so there is not much available about performance yet. But this technology has already been utilized and tested from the start by the world’s number one e-commerce site, namely Amazon, and was also being tested by a variety of enterprises significantly prior to wide release. There are bound to be a number of issues arising initially after release since it is going to be able to be used widely, but these should settle down fairly quickly. In the benchmarks referred to earlier, we found that it took a long time to load big batches of data onto Redshift, but we also found that by increasing the number of nodes, we achieved a significant increase in performance. FlyData is also working to develop a continuous data upload mechanism to avoid delays. Through articles like this one, FlyData aims to support anyone who chooses to use Redshift, as its technical content and ways that it can be used are not widely known yet.

Segregation with Hadoop

Of course there are many areas that Hadoop is better at than Redshift. There are examples of advanced processing that can only be done on Hadoop such as processing the whole database record by record and analytics that use complex machine learning. Hadoop may be superior in terms of cost performance for processing data that is not analyzed frequently (annual or monthly petabyte scale batch processing). On the other hand, for processing data close to real time with a short turnaround on constantly updated data (minutes old), Redshift may show big advantages over Hadoop and other data warehousing systems. Such applications are common in advertising technology, digital marketing, and social gaming analysis to name a few.

Redshift is the Killer App for Big Data Analysis

Till now we at FlyData have proceeded under the belief that the shortest path to bestowing the benefits of big data to companies of any size lie in figuring out how to make it cheaper to use Hadoop. We held the belief that Hadoop was the only possibility for the general use of big data. This viewpoint has been completely overturned by Redshift. This is a most auspicious thing. Why? Because this means that access to big data analytics is now open to any size company, providing big data that is easy to use and a low price point. At FlyData, we want to help make Redshift easy to use so everyone can leverage big data in the cloud to open up the path to a better world.


About FlyData Team:
FlyData syncs your data to Amazon Redshift. We provide intelligent data integration software that handles all of the back-end data plumbing, letting you focus on the data analysis instead of all of the setup work.

FlyData handles real-time replication for Amazon RDS and Aurora, MySQL and PostgreSQL.

Get set up in minutes. Start uncovering data to make faster, better business decisions today.

By using this website you agree to accept our Privacy Policy and Terms & Conditions