Amazon Redshift was birthed out of PostgreSQL 8.0.2. For this reason, many analysts and engineers making the move from Postgres to Redshift feel a certain comfort and familiarity about the transition. While it is true that much of the syntax and functionality crosses over, there are key differences in syntactic structure, performance, and the mechanics under the hood. In this guide, we explore those differences so that you as a data developer don’t fall for the most common pitfalls in assuming Redshift is just “Amazon’s Postgres 2.0.”
Rows vs. Columns in Postgres and Redshift
Among the key differences between Postgres and Redshift are that Postgres is a row-store database while Redshift is column-stored. This drastically affects query performance for basic SELECT statements. Think of it as an inverse relationship or trade-off between number of columns and number of rows you can query in one shot, for a given query fetch time.
In Postgres, rows are the fundamental data object. This means that tables are comprised of entities called rows, which must then be further parsed to extract columnar data. For this reason, it is simple to query many rows in Postgres, regardless of the number of columns each row contains. You could have 50 columns fetched from your 300k row result, and it would take negligibly more time to fetch than 20 rows. So, if you want a ton of columns and only a small sample of data to explore with (say, only 50k rows), you get all the columns basically for free!
There’s a specific drawback to this methodology when dealing with “Big Data”. You may have heard of the 4 Vs of big data: Volume, Velocity, Veracity, and Variety. In this case, we’re talking specifically about volume (row count) and variety (column count… though you could construe column count as another dimension of volume, but that’s getting more philosophical than is strictly useful here). What happens if you have a massive table (or query resulting in a massive table result), with say, 100 columns and 10 million rows, but you only care about 5 columns? With Postgres, you have no control over the fact that you must fetch all 100 columns for all 10 million rows, because columns are a substructure of each row. This results in unnecessarily slow queries for data developers, especially during exploratory or optimization-discovery phases.
This is where Redshift saves the day. In Redshift, columns are the fundamental objects. Each column is literally its own file, mapped to a given table, and you must parse rows out of it. This makes quick work out of querying millions of rows at a time when you’re only concerned with a handful of columns, which is great for modestly refined business processes where you know exactly what you need and you need to process your massive-number-of-rows dataset into actionable insights. Of course, where this mechanism falls short is exploring many columns at once. What’s worse, most practitioners new to Redshift know that it’s a Big Data platform and assume they can query all these columns and all these rows, no sweat. Unfortunately, I’ve seen Redshift queries scale horribly with column counts when querying wide (many-column) tables.
Redshift vs. Postgres: Table Constraints
Due to the row vs column store nature of these databases, indexes and constraints need to be implemented completely differently. Technically, there’s no such thing as a proper “index” on Redshift – a feature commonly supported by any standard row-store database like Postgres, MySQL, SQLite, etc. This means that there isn’t even the real concept of a primary key in Redshift! Of course, since each column is its own file, looking up row number X is trivial – so it’s not like you really need a primary key for indexing purposes on a single column. However, it’s important to use Redshift’s SORT KEY and DIST KEY for optimizing common queries against tables, and even performing cross-table lookups.
But what about foreign key constraints? Not only is this useful for fast lookups between tables with N:N relationships, but it's necessary for enforcing good data modeling. Things like this prevent you from adding rows to one table that have no match in another, usually highly business related table. Unfortunately, a drawback of Redshift is that foreign key constraints are not respected. In Redshift, you can define foreign key constraints, but they are little more than author’s notes on the data and are NOT ENFORCED when inserting or deleting rows of data. There are two other table constraints that are “permitted” but informational only (i.e.: not respected by the database engine): primary keys and unique constraints. There is one key advantage of applying these informational-only constraints, however – the query planner utilizes them. Check constraints and exclusion constraints aren’t allowed at all in Redshift, either.
Postgres vs. Redshift: Data Types and Other Features
Redshift does not support the following data types, which are supported by Postgres, according to Amazon’s own documentation:
- BIT, BIT VARYING
- Composite Types
- Date/Time Types
- Enumerated Types
- Geometric Types
- Network Address Types
- Numeric Types
- SERIAL, BIGSERIAL, SMALLSERIAL
- Object Identifier Types
- Range Types
- Text Search Types
A quick note: the TIME data type is not the same as the TIMESTAMP datatype – Redshift still supports date and time stamp data!
Check out some more of the AWS docs for more unsupported Postgres features, but conspicuous among them are Tablespaces, database roles (“users” take the same effective place, and can be assigned to “groups”), stored procedures, table functions, NULLS in window functions, triggers, or collation functions (ORDER BY clauses return results based on binary UTF-8 ordering, ignoring locale-specific characters).
There are also many unsupported Postgres functions you won’t have handy in Redshift, some of which include string_agg (replaced by LISTAGG), several covariance and regression functions, array functions (since arrays are not supported), range functions and operators (since ranges aren’t supported), several string functions, including REGEXP_MATCHES(), trigger functions (since triggers aren’t supported), and XML functions (since XML isn’t supported).
At FlyData, we use the following data-type mappings to move data between Postgres and Redshift quickly and easily. Don’t let all these lack-ofs dissuade you from Redshift, however. Redshift is simply built with data warehousing and fast analytics in mind, not intensely complex and nuanced data processing. Ultimately, Redshift is a far more scalable solution than Postgres.
Redshift vs. Postgres: The Power of Distributed Data
The row-store vs column-store nature of Postgres and Redshift, respectively, is only half the story, though. To manage millions or billions of rows of data, Redshift operates on a cluster with a single leader node and a user-selected number of worker nodes. The leader node is responsible for managing how data and workloads are distributed across the several worker nodes. Generally, more workers means more horsepower and faster queries, but this comes at a cost. Literally, it just costs more money. Of course, spinning up just 2-4 worker nodes should make for a reasonable price vs performance balance. That being said, different companies have different needs, so check out our guide to Redshift pricing if your current cluster setup isn’t quite at your sweet spot. Regardless, it should be a pretty intuitive result that clusters outperform single-node Postgres databases for equivalent under-the-hood operations, and this is a strong plus for migrating over to Redshift.
What Redshift lacks in features, it makes up for in scalability and ability to processes large volumes of data. If your data is not “big” (remember your 4 V’s!), consider sticking with Postgres or another row-store database a while longer… but if you’re keeping scalability in mind, it’s always better to migrate to Redshift earlier rather than later.